Compare commits
No commits in common. "master" and "feature/add-model-profiles" have entirely different histories.
master
...
feature/ad
@ -1,94 +0,0 @@
|
|||||||
# Plan: Add user model profiles to manifest.yaml
|
|
||||||
# Date: 2025-06-15
|
|
||||||
# Author: Hermes Agent
|
|
||||||
# Status: DRAFT
|
|
||||||
|
|
||||||
## Context
|
|
||||||
User has a collection of GGUF models on their Main PC (10.0.4.11, RTX 3090 24GB VRAM).
|
|
||||||
The intelligence-router manifest needs profiles for all models with researched llama.cpp parameters.
|
|
||||||
Research sourced from r/LocalLLaMA, HuggingFace model cards, and community blog posts.
|
|
||||||
|
|
||||||
## Hardware constraints
|
|
||||||
- GPU: RTX 3090, 24GB VRAM
|
|
||||||
- All profiles use `n_gpu_layers: 999` (offload all layers that fit)
|
|
||||||
- All profiles use `flash-attn: on`
|
|
||||||
- KV cache quantization (q8_0 or q4_0) to enable 64K+ context within 24GB VRAM
|
|
||||||
- `min_p` set to 0.0 across all profiles (community standard for these models)
|
|
||||||
|
|
||||||
## Models to add (excluding mmproj files)
|
|
||||||
|
|
||||||
### Qwen3.6-27B (1 file: Qwen3.6-27B-Q4_K_M.gguf, ~10.5 GB)
|
|
||||||
Recommended sampling per HF model card and Unsloth: temp 0.6 / top_p 0.95 / top_k 20
|
|
||||||
|
|
||||||
| # | Profile ID | Name | n_ctx | cache_k/v | temp | top_k | repeat_pen |
|
|
||||||
|---|-----------|------|-------|-----------|------|-------|------------|
|
|
||||||
| 1 | qwen36-27b-balanced-64k | Qwen3.6-27B Balanced 64K | 65536 | q8_0/q8_0 | 0.6 | 20 | 1.0 |
|
|
||||||
| 2 | qwen36-27b-thinking-64k | Qwen3.6-27B Thinking 64K | 65536 | q8_0/q8_0 | 1.0 | 20 | 1.0 |
|
|
||||||
| 3 | qwen36-27b-extended-128k | Qwen3.6-27B Extended 128K | 131072 | q4_0/q4_0 | 0.6 | 20 | 1.05 |
|
|
||||||
|
|
||||||
### Gemma 4 12B (2 files: Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB)
|
|
||||||
Google official: temp 1.0 / top_p 0.95 / top_k 64
|
|
||||||
|
|
||||||
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k |
|
|
||||||
|---|-----------|------|------|-------|-----------|------|-------|
|
|
||||||
| 4 | gemma4-12b-standard-q6-64k | Gemma4 12B Standard Q6 64K | Q6_K_XL | 65536 | q8_0/q8_0 | 1.0 | 64 |
|
|
||||||
| 5 | gemma4-12b-extended-q6-128k | Gemma4 12B Extended Q6 128K | Q6_K_XL | 131072 | q4_0/q4_0 | 1.0 | 64 |
|
|
||||||
| 6 | gemma4-12b-compact-iq4-64k | Gemma4 12B Compact IQ4 64K | IQ4_XS | 65536 | q8_0/q8_0 | 1.0 | 64 |
|
|
||||||
| 7 | gemma4-12b-compact-long-128k | Gemma4 12B Compact IQ4 128K | IQ4_XS | 131072 | q8_0/q8_0 | 1.0 | 64 |
|
|
||||||
|
|
||||||
### Gemma 4 26B-A4B (2 files: Q4_K_M ~10.5 GB, IQ4_XS ~6 GB)
|
|
||||||
MoE, 4B active. Same sampling as 12B family.
|
|
||||||
|
|
||||||
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | repeat_pen |
|
|
||||||
|---|-----------|------|------|-------|-----------|------|-------|------------|
|
|
||||||
| 8 | gemma4-26b-balanced-64k | Gemma4 26B Balanced 64K | Q4_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | 1.0 |
|
|
||||||
| 9 | gemma4-26b-extended-128k | Gemma4 26B Extended 128K | Q4_K_M | 131072 | q4_0/q4_0 | 1.0 | 64 | 1.15 |
|
|
||||||
| 10 | gemma4-26b-ultra-long-iq4-256k | Gemma4 26B Ultra-Long IQ4 256K | IQ4_XS | 262144 | q4_0/q4_0 | 1.0 | 64 | 1.0 |
|
|
||||||
|
|
||||||
### Qwen3.6-35B-A3B (2 files: UD-Q4_K_M ~14 GB, MTP-UD-Q4_K_M ~16 GB)
|
|
||||||
**MTP note:** Unsloth benchmark shows MTP is net-negative on single 3090. Including MTP profile anyway since user has the file.
|
|
||||||
|
|
||||||
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | MTP |
|
|
||||||
|---|-----------|------|------|-------|-----------|------|-------|-----|
|
|
||||||
| 11 | qwen36-35b-fast-64k | Qwen3.6-35B Fast 64K | UD-Q4 | 65536 | q8_0/q8_0 | 0.6 | 20 | no |
|
|
||||||
| 12 | qwen36-35b-thinking-64k | Qwen3.6-35B Thinking 64K | UD-Q4 | 65536 | q8_0/q8_0 | 1.0 | 20 | no |
|
|
||||||
| 13 | qwen36-35b-extended-128k | Qwen3.6-35B Extended 128K | UD-Q4 | 131072 | q4_0/q4_0 | 0.6 | 20 | no |
|
|
||||||
| 14 | qwen36-35b-mtp-128k | Qwen3.6-35B MTP 128K | MTP-UD-Q4 | 131072 | q8_0/q8_0 | 0.6 | 20 | yes (n=3) |
|
|
||||||
|
|
||||||
### Uncensored models (apply censored family params)
|
|
||||||
|
|
||||||
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | Based on |
|
|
||||||
|---|-----------|------|------|-------|-----------|------|-------|----------|
|
|
||||||
| 15 | qwen36-35b-hauhau-aggressive-64k | Qwen3.6-35B HauhauCS Aggressive 64K | Uncensored-HauhauCS-Q4_K_P | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
|
|
||||||
| 16 | qwen36-35b-genesis-apex-64k | Qwen3.6-35B Genesis APEX 64K | Uncensored-Genesis-APEX | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
|
|
||||||
| 17 | qwen36-35b-genesis-mtp-apex-128k | Qwen3.6-35B Genesis MTP APEX 128K | Uncensored-Genesis-MTP-APEX | 131072 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B MTP |
|
|
||||||
| 18 | gemma4-26b-hauhau-balanced-64k | Gemma4 26B HauhauCS Balanced 64K | Uncensored-HauhauCS-Q5_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | Gemma4 26B balanced |
|
|
||||||
|
|
||||||
**Total: 18 profiles**
|
|
||||||
|
|
||||||
## Flag mapping (manifest → llama-server CLI)
|
|
||||||
|
|
||||||
Manifest flags use camelCase keys that the sidecar passes as `--key value` to llama-server:
|
|
||||||
|
|
||||||
| Manifest key | CLI flag | Type | Notes |
|
|
||||||
|-------------|----------|------|-------|
|
|
||||||
| n_gpu_layers | --n-gpu-layers | int | 999 = all |
|
|
||||||
| n_ctx | --ctx-size | int | context window |
|
|
||||||
| cache_type_k | --cache-type-k | str | q8_0, q4_0 |
|
|
||||||
| cache_type_v | --cache-type-v | str | q8_0, q4_0 |
|
|
||||||
| flash_attn | --flash-attn | bool | true/on |
|
|
||||||
| temp | --temp | float | sampling |
|
|
||||||
| top_p | --top-p | float | sampling |
|
|
||||||
| top_k | --top-k | int | sampling |
|
|
||||||
| repeat_penalty | --repeat-penalty | float | sampling |
|
|
||||||
| min_p | --min-p | float | 0.0 |
|
|
||||||
| spec_type | --spec-type | str | draft-mtp (only MTP profiles) |
|
|
||||||
| spec_draft_n_max | --spec-draft-n-max | int | 3 (only MTP profiles) |
|
|
||||||
| presence_penalty | --presence-penalty | float | 0.0 |
|
|
||||||
|
|
||||||
## Actions
|
|
||||||
1. Create branch `feature/add-model-profiles` from master
|
|
||||||
2. Create issues on Gitea for each model family (4 issues: qwen27, gemma12b, gemma26b, qwen35b)
|
|
||||||
3. Update `deploy/manifest.yaml` with all 18 profiles
|
|
||||||
4. Update tests if flag structure requires it
|
|
||||||
5. Run tests, commit
|
|
||||||
@ -12,7 +12,6 @@ EnvironmentFile=-/home/bigt/AI/llm/.env
|
|||||||
Environment=MANIFEST_PATH=/home/bigt/AI/llm/manifest.yaml
|
Environment=MANIFEST_PATH=/home/bigt/AI/llm/manifest.yaml
|
||||||
Environment=SIDECAR_PORT=8080
|
Environment=SIDECAR_PORT=8080
|
||||||
Environment=PATH=/home/bigt/AI/llm/venv/bin:/usr/local/bin:/usr/bin:/bin
|
Environment=PATH=/home/bigt/AI/llm/venv/bin:/usr/local/bin:/usr/bin:/bin
|
||||||
Environment=PYTHONUNBUFFERED=1
|
|
||||||
|
|
||||||
# Use the sidecar's venv — install deps via deploy/README.md
|
# Use the sidecar's venv — install deps via deploy/README.md
|
||||||
ExecStart=/home/bigt/AI/llm/venv/bin/uvicorn sidecar.app:app --host 0.0.0.0 --port 8080
|
ExecStart=/home/bigt/AI/llm/venv/bin/uvicorn sidecar.app:app --host 0.0.0.0 --port 8080
|
||||||
|
|||||||
@ -11,88 +11,141 @@
|
|||||||
# All profiles use flash-attn: on, n-gpu-layers: 999 (offload all)
|
# All profiles use flash-attn: on, n-gpu-layers: 999 (offload all)
|
||||||
# KV cache quantization (q8_0/q4_0) enables 64K+ context within 24GB VRAM
|
# KV cache quantization (q8_0/q4_0) enables 64K+ context within 24GB VRAM
|
||||||
|
|
||||||
|
- id: qwen-3-8b
|
||||||
|
name: "Qwen 3 8B"
|
||||||
|
model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
|
||||||
|
flags:
|
||||||
|
n_ctx: 8192
|
||||||
|
n_gpu_layers: 35
|
||||||
|
|
||||||
|
- id: qwen-3-8b-long
|
||||||
|
name: "Qwen 3 8B (Long Context)"
|
||||||
|
model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
|
||||||
|
flags:
|
||||||
|
n_ctx: 32768
|
||||||
|
n_gpu_layers: 20
|
||||||
|
|
||||||
|
- id: llama-4-maverick
|
||||||
|
name: "Llama 4 Maverick"
|
||||||
|
model_path: "/home/bigt/AI/llm/llama4/llama4-maverick-q4.gguf"
|
||||||
|
flags:
|
||||||
|
n_ctx: 8192
|
||||||
|
n_gpu_layers: 35
|
||||||
|
|
||||||
# --- Qwen3.6-27B (Q4_K_M, ~10.5 GB) ---
|
# --- Qwen3.6-27B (Q4_K_M, ~10.5 GB) ---
|
||||||
# Sampling: temp 0.6/1.0, top_p 0.95, top_k 20
|
# Sampling: temp 0.6/1.0, top_p 0.95, top_k 20
|
||||||
- id: qwen36-27b-balanced-64k
|
- id: qwen36-27b-balanced-64k
|
||||||
name: "Qwen3.6-27B Balanced 64K"
|
name: "Qwen3.6-27B Balanced 64K"
|
||||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
|
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
|
||||||
flags:
|
flags:
|
||||||
ctx-size: 65536
|
n_ctx: 65536
|
||||||
n-gpu-layers: 999
|
n_gpu_layers: 999
|
||||||
cache-type-k: q8_0
|
cache-type-k: q8_0
|
||||||
cache-type-v: q8_0
|
cache-type-v: q8_0
|
||||||
flash-attn: on
|
flash-attn: on
|
||||||
temp: 0.6
|
temp: 0.6
|
||||||
top-p: 0.95
|
top_p: 0.95
|
||||||
top-k: 20
|
top_k: 20
|
||||||
repeat-penalty: 1.0
|
repeat-penalty: 1.0
|
||||||
min-p: 0.0
|
min_p: 0.0
|
||||||
presence-penalty: 0.0
|
presence-penalty: 0.0
|
||||||
|
|
||||||
- id: qwen36-27b-thinking-64k
|
- id: qwen36-27b-thinking-64k
|
||||||
name: "Qwen3.6-27B Thinking 64K"
|
name: "Qwen3.6-27B Thinking 64K"
|
||||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
|
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
|
||||||
flags:
|
flags:
|
||||||
ctx-size: 65536
|
n_ctx: 65536
|
||||||
n-gpu-layers: 999
|
n_gpu_layers: 999
|
||||||
cache-type-k: q8_0
|
cache-type-k: q8_0
|
||||||
cache-type-v: q8_0
|
cache-type-v: q8_0
|
||||||
flash-attn: on
|
flash-attn: on
|
||||||
temp: 1.0
|
temp: 1.0
|
||||||
top-p: 0.95
|
top_p: 0.95
|
||||||
top-k: 20
|
top_k: 20
|
||||||
repeat-penalty: 1.0
|
repeat-penalty: 1.0
|
||||||
min-p: 0.0
|
min_p: 0.0
|
||||||
presence-penalty: 0.0
|
presence-penalty: 0.0
|
||||||
|
|
||||||
- id: qwen36-27b-extended-128k
|
- id: qwen36-27b-extended-128k
|
||||||
name: "Qwen3.6-27B Extended 128K"
|
name: "Qwen3.6-27B Extended 128K"
|
||||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
|
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
|
||||||
flags:
|
flags:
|
||||||
ctx-size: 131072
|
n_ctx: 131072
|
||||||
n-gpu-layers: 999
|
n_gpu_layers: 999
|
||||||
cache-type-k: q4_0
|
cache-type-k: q4_0
|
||||||
cache-type-v: q4_0
|
cache-type-v: q4_0
|
||||||
flash-attn: on
|
flash-attn: on
|
||||||
temp: 0.6
|
temp: 0.6
|
||||||
top-p: 0.95
|
top_p: 0.95
|
||||||
top-k: 20
|
top_k: 20
|
||||||
repeat-penalty: 1.05
|
repeat-penalty: 1.05
|
||||||
min-p: 0.0
|
min_p: 0.0
|
||||||
presence-penalty: 0.0
|
presence-penalty: 0.0
|
||||||
|
|
||||||
# --- Gemma 4 12B (Q6_K_XL ~8.5 GB) ---
|
# --- Gemma 4 12B (Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB) ---
|
||||||
# Sampling: temp 1.0, top_p 0.95, top_k 64 (Google official)
|
# Sampling: temp 1.0, top_p 0.95, top_k 64 (Google official)
|
||||||
- id: gemma4-12b-standard-q6-64k
|
- id: gemma4-12b-standard-q6-64k
|
||||||
name: "Gemma4 12B Standard Q6 64K"
|
name: "Gemma4 12B Standard Q6 64K"
|
||||||
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
|
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
|
||||||
flags:
|
flags:
|
||||||
ctx-size: 65536
|
n_ctx: 65536
|
||||||
n-gpu-layers: 999
|
n_gpu_layers: 999
|
||||||
cache-type-k: q8_0
|
cache-type-k: q8_0
|
||||||
cache-type-v: q8_0
|
cache-type-v: q8_0
|
||||||
flash-attn: on
|
flash-attn: on
|
||||||
temp: 1.0
|
temp: 1.0
|
||||||
top-p: 0.95
|
top_p: 0.95
|
||||||
top-k: 64
|
top_k: 64
|
||||||
repeat-penalty: 1.0
|
repeat-penalty: 1.0
|
||||||
min-p: 0.0
|
min_p: 0.0
|
||||||
presence-penalty: 0.0
|
presence-penalty: 0.0
|
||||||
|
|
||||||
- id: gemma4-12b-extended-q6-128k
|
- id: gemma4-12b-extended-q6-128k
|
||||||
name: "Gemma4 12B Extended Q6 128K"
|
name: "Gemma4 12B Extended Q6 128K"
|
||||||
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
|
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
|
||||||
flags:
|
flags:
|
||||||
ctx-size: 131072
|
n_ctx: 131072
|
||||||
n-gpu-layers: 999
|
n_gpu_layers: 999
|
||||||
cache-type-k: q4_0
|
cache-type-k: q4_0
|
||||||
cache-type-v: q4_0
|
cache-type-v: q4_0
|
||||||
flash-attn: on
|
flash-attn: on
|
||||||
temp: 1.0
|
temp: 1.0
|
||||||
top-p: 0.95
|
top_p: 0.95
|
||||||
top-k: 64
|
top_k: 64
|
||||||
repeat-penalty: 1.0
|
repeat-penalty: 1.0
|
||||||
min-p: 0.0
|
min_p: 0.0
|
||||||
|
presence-penalty: 0.0
|
||||||
|
|
||||||
|
- id: gemma4-12b-compact-iq4-64k
|
||||||
|
name: "Gemma4 12B Compact IQ4 64K"
|
||||||
|
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
|
||||||
|
flags:
|
||||||
|
n_ctx: 65536
|
||||||
|
n_gpu_layers: 999
|
||||||
|
cache-type-k: q8_0
|
||||||
|
cache-type-v: q8_0
|
||||||
|
flash-attn: on
|
||||||
|
temp: 1.0
|
||||||
|
top_p: 0.95
|
||||||
|
top_k: 64
|
||||||
|
repeat-penalty: 1.0
|
||||||
|
min_p: 0.0
|
||||||
|
presence-penalty: 0.0
|
||||||
|
|
||||||
|
- id: gemma4-12b-compact-long-128k
|
||||||
|
name: "Gemma4 12B Compact IQ4 128K"
|
||||||
|
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
|
||||||
|
flags:
|
||||||
|
n_ctx: 131072
|
||||||
|
n_gpu_layers: 999
|
||||||
|
cache-type-k: q8_0
|
||||||
|
cache-type-v: q8_0
|
||||||
|
flash-attn: on
|
||||||
|
temp: 1.0
|
||||||
|
top_p: 0.95
|
||||||
|
top_k: 64
|
||||||
|
repeat-penalty: 1.0
|
||||||
|
min_p: 0.0
|
||||||
presence-penalty: 0.0
|
presence-penalty: 0.0
|
||||||
|
|
||||||
# --- Gemma 4 26B-A4B (Q4_K_M ~10.5 GB, IQ4_XS ~6 GB) ---
|
# --- Gemma 4 26B-A4B (Q4_K_M ~10.5 GB, IQ4_XS ~6 GB) ---
|
||||||
@ -101,97 +154,48 @@
|
|||||||
name: "Gemma4 26B Balanced 64K"
|
name: "Gemma4 26B Balanced 64K"
|
||||||
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
|
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
|
||||||
flags:
|
flags:
|
||||||
ctx-size: 65536
|
n_ctx: 65536
|
||||||
n-gpu-layers: 999
|
n_gpu_layers: 999
|
||||||
cache-type-k: q8_0
|
cache-type-k: q8_0
|
||||||
cache-type-v: q8_0
|
cache-type-v: q8_0
|
||||||
flash-attn: on
|
flash-attn: on
|
||||||
temp: 1.0
|
temp: 1.0
|
||||||
top-p: 0.95
|
top_p: 0.95
|
||||||
top-k: 64
|
top_k: 64
|
||||||
repeat-penalty: 1.0
|
repeat-penalty: 1.0
|
||||||
min-p: 0.0
|
min_p: 0.0
|
||||||
presence-penalty: 0.0
|
presence-penalty: 0.0
|
||||||
|
|
||||||
- id: gemma4-26b-extended-128k
|
- id: gemma4-26b-extended-128k
|
||||||
name: "Gemma4 26B Extended 128K"
|
name: "Gemma4 26B Extended 128K"
|
||||||
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
|
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
|
||||||
flags:
|
flags:
|
||||||
ctx-size: 131072
|
n_ctx: 131072
|
||||||
n-gpu-layers: 999
|
n_gpu_layers: 999
|
||||||
cache-type-k: q4_0
|
cache-type-k: q4_0
|
||||||
cache-type-v: q4_0
|
cache-type-v: q4_0
|
||||||
flash-attn: on
|
flash-attn: on
|
||||||
temp: 1.0
|
temp: 1.0
|
||||||
top-p: 0.95
|
top_p: 0.95
|
||||||
top-k: 64
|
top_k: 64
|
||||||
repeat-penalty: 1.15
|
repeat-penalty: 1.15
|
||||||
min-p: 0.0
|
min_p: 0.0
|
||||||
presence-penalty: 0.0
|
presence-penalty: 0.0
|
||||||
|
|
||||||
- id: gemma4-26b-ultra-long-iq4-128k
|
- id: gemma4-26b-ultra-long-iq4-128k
|
||||||
name: "Gemma4 26B Ultra-Long IQ4 128K"
|
name: "Gemma4 26B Ultra-Long IQ4 128K"
|
||||||
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf"
|
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf"
|
||||||
flags:
|
flags:
|
||||||
ctx-size: 131072
|
n_ctx: 131072
|
||||||
n-gpu-layers: 999
|
n_gpu_layers: 999
|
||||||
cache-type-k: q4_0
|
cache-type-k: q4_0
|
||||||
cache-type-v: q4_0
|
cache-type-v: q4_0
|
||||||
flash-attn: on
|
flash-attn: on
|
||||||
temp: 1.0
|
temp: 1.0
|
||||||
top-p: 0.95
|
top_p: 0.95
|
||||||
top-k: 64
|
top_k: 64
|
||||||
repeat-penalty: 1.0
|
repeat-penalty: 1.0
|
||||||
min-p: 0.0
|
min_p: 0.0
|
||||||
presence-penalty: 0.0
|
|
||||||
|
|
||||||
- id: gemma4-26b-q5-64k
|
|
||||||
name: "Gemma4 26B Q5 64K"
|
|
||||||
model_path: "/home/bigt/AI/llm/gemma4/google_gemma-4-26B-A4B-it-Q5_K_M.gguf"
|
|
||||||
flags:
|
|
||||||
ctx-size: 65536
|
|
||||||
n-gpu-layers: 999
|
|
||||||
cache-type-k: q8_0
|
|
||||||
cache-type-v: q8_0
|
|
||||||
flash-attn: on
|
|
||||||
temp: 1.0
|
|
||||||
top-p: 0.95
|
|
||||||
top-k: 64
|
|
||||||
repeat-penalty: 1.0
|
|
||||||
min-p: 0.0
|
|
||||||
presence-penalty: 0.0
|
|
||||||
|
|
||||||
# --- Gemma 4 26B Compact (IQ4_XS ~6 GB) ---
|
|
||||||
- id: gemma4-26b-compact-iq4-64k
|
|
||||||
name: "Gemma4 26B Compact IQ4 64K"
|
|
||||||
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
|
|
||||||
flags:
|
|
||||||
ctx-size: 65536
|
|
||||||
n-gpu-layers: 999
|
|
||||||
cache-type-k: q8_0
|
|
||||||
cache-type-v: q8_0
|
|
||||||
flash-attn: on
|
|
||||||
temp: 1.0
|
|
||||||
top-p: 0.95
|
|
||||||
top-k: 64
|
|
||||||
repeat-penalty: 1.0
|
|
||||||
min-p: 0.0
|
|
||||||
presence-penalty: 0.0
|
|
||||||
|
|
||||||
- id: gemma4-26b-compact-long-128k
|
|
||||||
name: "Gemma4 26B Compact IQ4 128K"
|
|
||||||
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
|
|
||||||
flags:
|
|
||||||
ctx-size: 131072
|
|
||||||
n-gpu-layers: 999
|
|
||||||
cache-type-k: q4_0
|
|
||||||
cache-type-v: q4_0
|
|
||||||
flash-attn: on
|
|
||||||
temp: 1.0
|
|
||||||
top-p: 0.95
|
|
||||||
top-k: 64
|
|
||||||
repeat-penalty: 1.0
|
|
||||||
min-p: 0.0
|
|
||||||
presence-penalty: 0.0
|
presence-penalty: 0.0
|
||||||
|
|
||||||
# --- Qwen3.6-35B-A3B (UD-Q4_K_M ~14 GB) ---
|
# --- Qwen3.6-35B-A3B (UD-Q4_K_M ~14 GB) ---
|
||||||
@ -201,144 +205,95 @@
|
|||||||
name: "Qwen3.6-35B Fast 64K"
|
name: "Qwen3.6-35B Fast 64K"
|
||||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
|
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
|
||||||
flags:
|
flags:
|
||||||
ctx-size: 65536
|
n_ctx: 65536
|
||||||
n-gpu-layers: 999
|
n_gpu_layers: 999
|
||||||
cache-type-k: q8_0
|
cache-type-k: q8_0
|
||||||
cache-type-v: q8_0
|
cache-type-v: q8_0
|
||||||
flash-attn: on
|
flash-attn: on
|
||||||
temp: 0.6
|
temp: 0.6
|
||||||
top-p: 0.95
|
top_p: 0.95
|
||||||
top-k: 20
|
top_k: 20
|
||||||
repeat-penalty: 1.0
|
repeat-penalty: 1.0
|
||||||
min-p: 0.0
|
min_p: 0.0
|
||||||
presence-penalty: 0.0
|
presence-penalty: 0.0
|
||||||
|
|
||||||
- id: qwen36-35b-thinking-64k
|
- id: qwen36-35b-thinking-64k
|
||||||
name: "Qwen3.6-35B Thinking 64K"
|
name: "Qwen3.6-35B Thinking 64K"
|
||||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
|
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
|
||||||
flags:
|
flags:
|
||||||
ctx-size: 65536
|
n_ctx: 65536
|
||||||
n-gpu-layers: 999
|
n_gpu_layers: 999
|
||||||
cache-type-k: q8_0
|
cache-type-k: q8_0
|
||||||
cache-type-v: q8_0
|
cache-type-v: q8_0
|
||||||
flash-attn: on
|
flash-attn: on
|
||||||
temp: 1.0
|
temp: 1.0
|
||||||
top-p: 0.95
|
top_p: 0.95
|
||||||
top-k: 20
|
top_k: 20
|
||||||
repeat-penalty: 1.0
|
repeat-penalty: 1.0
|
||||||
min-p: 0.0
|
min_p: 0.0
|
||||||
presence-penalty: 0.0
|
presence-penalty: 0.0
|
||||||
|
|
||||||
- id: qwen36-35b-extended-128k
|
- id: qwen36-35b-extended-128k
|
||||||
name: "Qwen3.6-35B Extended 128K"
|
name: "Qwen3.6-35B Extended 128K"
|
||||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
|
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
|
||||||
flags:
|
flags:
|
||||||
ctx-size: 131072
|
n_ctx: 131072
|
||||||
n-gpu-layers: 999
|
n_gpu_layers: 999
|
||||||
cache-type-k: q4_0
|
cache-type-k: q4_0
|
||||||
cache-type-v: q4_0
|
cache-type-v: q4_0
|
||||||
flash-attn: on
|
flash-attn: on
|
||||||
temp: 0.6
|
temp: 0.6
|
||||||
top-p: 0.95
|
top_p: 0.95
|
||||||
top-k: 20
|
top_k: 20
|
||||||
repeat-penalty: 1.0
|
repeat-penalty: 1.0
|
||||||
min-p: 0.0
|
min_p: 0.0
|
||||||
presence-penalty: 0.0
|
presence-penalty: 0.0
|
||||||
|
|
||||||
# --- Qwen3.6-35B-A3B MTP variant ---
|
# --- Uncensored models (apply censored family params) ---
|
||||||
- id: qwen36-35b-mtp-fast-64k
|
|
||||||
name: "Qwen3.6-35B MTP Fast 64K"
|
|
||||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf"
|
|
||||||
flags:
|
|
||||||
ctx-size: 65536
|
|
||||||
n-gpu-layers: 999
|
|
||||||
cache-type-k: q8_0
|
|
||||||
cache-type-v: q8_0
|
|
||||||
flash-attn: on
|
|
||||||
temp: 0.6
|
|
||||||
top-p: 0.95
|
|
||||||
top-k: 20
|
|
||||||
repeat-penalty: 1.0
|
|
||||||
min-p: 0.0
|
|
||||||
presence-penalty: 0.0
|
|
||||||
|
|
||||||
- id: qwen36-35b-mtp-extended-128k
|
|
||||||
name: "Qwen3.6-35B MTP Extended 128K"
|
|
||||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf"
|
|
||||||
flags:
|
|
||||||
ctx-size: 131072
|
|
||||||
n-gpu-layers: 999
|
|
||||||
cache-type-k: q4_0
|
|
||||||
cache-type-v: q4_0
|
|
||||||
flash-attn: on
|
|
||||||
temp: 0.6
|
|
||||||
top-p: 0.95
|
|
||||||
top-k: 20
|
|
||||||
repeat-penalty: 1.0
|
|
||||||
min-p: 0.0
|
|
||||||
presence-penalty: 0.0
|
|
||||||
|
|
||||||
# --- Uncensored models ---
|
|
||||||
- id: qwen36-35b-hauhau-aggressive-64k
|
- id: qwen36-35b-hauhau-aggressive-64k
|
||||||
name: "Qwen3.6-35B HauhauCS Aggressive 64K"
|
name: "Qwen3.6-35B HauhauCS Aggressive 64K"
|
||||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf"
|
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf"
|
||||||
flags:
|
flags:
|
||||||
ctx-size: 65536
|
n_ctx: 65536
|
||||||
n-gpu-layers: 999
|
n_gpu_layers: 999
|
||||||
cache-type-k: q8_0
|
cache-type-k: q8_0
|
||||||
cache-type-v: q8_0
|
cache-type-v: q8_0
|
||||||
flash-attn: on
|
flash-attn: on
|
||||||
temp: 0.6
|
temp: 0.6
|
||||||
top-p: 0.95
|
top_p: 0.95
|
||||||
top-k: 20
|
top_k: 20
|
||||||
repeat-penalty: 1.0
|
repeat-penalty: 1.0
|
||||||
min-p: 0.0
|
min_p: 0.0
|
||||||
presence-penalty: 0.0
|
presence-penalty: 0.0
|
||||||
|
|
||||||
- id: qwen36-35b-genesis-apex-64k
|
- id: qwen36-35b-genesis-apex-64k
|
||||||
name: "Qwen3.6-35B Genesis APEX 64K"
|
name: "Qwen3.6-35B Genesis APEX 64K"
|
||||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-APEX.gguf"
|
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-APEX.gguf"
|
||||||
flags:
|
flags:
|
||||||
ctx-size: 65536
|
n_ctx: 65536
|
||||||
n-gpu-layers: 999
|
n_gpu_layers: 999
|
||||||
cache-type-k: q8_0
|
cache-type-k: q8_0
|
||||||
cache-type-v: q8_0
|
cache-type-v: q8_0
|
||||||
flash-attn: on
|
flash-attn: on
|
||||||
temp: 0.6
|
temp: 0.6
|
||||||
top-p: 0.95
|
top_p: 0.95
|
||||||
top-k: 20
|
top_k: 20
|
||||||
repeat-penalty: 1.0
|
repeat-penalty: 1.0
|
||||||
min-p: 0.0
|
min_p: 0.0
|
||||||
presence-penalty: 0.0
|
|
||||||
|
|
||||||
- id: qwen36-35b-genesis-mtp-apex-64k
|
|
||||||
name: "Qwen3.6-35B Genesis MTP APEX 64K"
|
|
||||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-MTP-APEX.gguf"
|
|
||||||
flags:
|
|
||||||
ctx-size: 65536
|
|
||||||
n-gpu-layers: 999
|
|
||||||
cache-type-k: q8_0
|
|
||||||
cache-type-v: q8_0
|
|
||||||
flash-attn: on
|
|
||||||
temp: 0.6
|
|
||||||
top-p: 0.95
|
|
||||||
top-k: 20
|
|
||||||
repeat-penalty: 1.0
|
|
||||||
min-p: 0.0
|
|
||||||
presence-penalty: 0.0
|
presence-penalty: 0.0
|
||||||
|
|
||||||
- id: gemma4-26b-hauhau-balanced-64k
|
- id: gemma4-26b-hauhau-balanced-64k
|
||||||
name: "Gemma4 26B HauhauCS Balanced 64K"
|
name: "Gemma4 26B HauhauCS Balanced 64K"
|
||||||
model_path: "/home/bigt/AI/llm/gemma4/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q5_K_M.gguf"
|
model_path: "/home/bigt/AI/llm/gemma4/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q5_K_M.gguf"
|
||||||
flags:
|
flags:
|
||||||
ctx-size: 65536
|
n_ctx: 65536
|
||||||
n-gpu-layers: 999
|
n_gpu_layers: 999
|
||||||
cache-type-k: q8_0
|
cache-type-k: q8_0
|
||||||
cache-type-v: q8_0
|
cache-type-v: q8_0
|
||||||
flash-attn: on
|
flash-attn: on
|
||||||
temp: 1.0
|
temp: 1.0
|
||||||
top-p: 0.95
|
top_p: 0.95
|
||||||
top-k: 64
|
top_k: 64
|
||||||
repeat-penalty: 1.0
|
repeat-penalty: 1.0
|
||||||
min-p: 0.0
|
min_p: 0.0
|
||||||
presence-penalty: 0.0
|
presence-penalty: 0.0
|
||||||
@ -7,8 +7,8 @@ services:
|
|||||||
ports:
|
ports:
|
||||||
- "9001:9000"
|
- "9001:9000"
|
||||||
environment:
|
environment:
|
||||||
- SIDECAR_URL=http://10.0.4.11:8080
|
- SIDECAR_URL=http://10.0.4.11:8081
|
||||||
- MAIN_PC_URL=http://10.0.4.11:8081/v1
|
- MAIN_PC_URL=http://10.0.4.11:8080/v1
|
||||||
- FALLBACK_SLM_URL=http://10.0.4.200:8080/v1
|
- FALLBACK_SLM_URL=http://10.0.4.200:8080/v1
|
||||||
- OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-}
|
- OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-}
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
|
|||||||
437
main.py
437
main.py
@ -141,49 +141,6 @@ def complete_switch():
|
|||||||
_switching_event.set()
|
_switching_event.set()
|
||||||
|
|
||||||
|
|
||||||
async def _background_switch(requested_model: str):
|
|
||||||
"""Run a model switch in the background.
|
|
||||||
|
|
||||||
The sidecar POST is awaited but the caller gets an immediate SSE stream
|
|
||||||
so Hermes Desktop doesn't timeout waiting for the first response.
|
|
||||||
|
|
||||||
Called via asyncio.create_task() so it runs concurrently with the
|
|
||||||
SSE stream being sent to the client.
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
async with httpx.AsyncClient(timeout=120.0) as client:
|
|
||||||
switch_resp = await client.post(
|
|
||||||
f"{SIDECAR_URL}/models/switch",
|
|
||||||
json={"profile_id": requested_model},
|
|
||||||
)
|
|
||||||
switch_result = switch_resp.json()
|
|
||||||
if switch_result.get("status") == "ready":
|
|
||||||
print(
|
|
||||||
f"SWITCH SUCCESS: profile={requested_model}",
|
|
||||||
flush=True,
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
circuit_record_failure()
|
|
||||||
print(
|
|
||||||
f"SWITCH FAILED: profile={requested_model}, "
|
|
||||||
f"status={switch_result.get('status')}, "
|
|
||||||
f"message={switch_result.get('message', '(no message)')}",
|
|
||||||
flush=True,
|
|
||||||
)
|
|
||||||
except Exception as e:
|
|
||||||
circuit_record_failure()
|
|
||||||
print(
|
|
||||||
f"SWITCH EXCEPTION: profile={requested_model}, "
|
|
||||||
f"error={type(e).__name__}: {e}",
|
|
||||||
flush=True,
|
|
||||||
)
|
|
||||||
finally:
|
|
||||||
# Signal all queued requests so they can proceed (and fall
|
|
||||||
# through to the fallback chain if the switch failed).
|
|
||||||
complete_switch()
|
|
||||||
drain_queue()
|
|
||||||
|
|
||||||
|
|
||||||
# ─── App ─────────────────────────────────────────────────────────────────────
|
# ─── App ─────────────────────────────────────────────────────────────────────
|
||||||
@asynccontextmanager
|
@asynccontextmanager
|
||||||
async def lifespan(app: FastAPI):
|
async def lifespan(app: FastAPI):
|
||||||
@ -196,12 +153,6 @@ app = FastAPI(lifespan=lifespan)
|
|||||||
|
|
||||||
|
|
||||||
# ─── GET /v1/models — Issue #2 ──────────────────────────────────────────────
|
# ─── GET /v1/models — Issue #2 ──────────────────────────────────────────────
|
||||||
@app.get("/v1")
|
|
||||||
async def v1_root():
|
|
||||||
"""OpenAI API root — return basic info for Hermes Desktop WebUI probe."""
|
|
||||||
return {"object": "list", "data": []}
|
|
||||||
|
|
||||||
|
|
||||||
@app.get("/v1/models")
|
@app.get("/v1/models")
|
||||||
async def get_models():
|
async def get_models():
|
||||||
"""OpenAI-compatible /v1/models endpoint proxying to Sidecar."""
|
"""OpenAI-compatible /v1/models endpoint proxying to Sidecar."""
|
||||||
@ -228,170 +179,6 @@ async def health():
|
|||||||
return {"status": "router_online"}
|
return {"status": "router_online"}
|
||||||
|
|
||||||
|
|
||||||
# ─── Hermes Desktop Probe Endpoints ──────────────────────────────────────────
|
|
||||||
# These endpoints are probed by Hermes Desktop to validate/identify the
|
|
||||||
# provider before allowing model switching. Without them the desktop
|
|
||||||
# returns 503 and refuses to switch models.
|
|
||||||
|
|
||||||
@app.get("/v1/models/{model_id:path}")
|
|
||||||
async def get_single_model(model_id: str):
|
|
||||||
"""OpenAI-compatible single model query. Proxied via Sidecar model list."""
|
|
||||||
async with httpx.AsyncClient(timeout=5.0) as client:
|
|
||||||
try:
|
|
||||||
resp = await client.get(f"{SIDECAR_URL}/models/available")
|
|
||||||
profiles = resp.json()
|
|
||||||
except Exception:
|
|
||||||
return JSONResponse(
|
|
||||||
status_code=503,
|
|
||||||
content={"error": "Sidecar unavailable", "data": []},
|
|
||||||
)
|
|
||||||
|
|
||||||
for p in profiles:
|
|
||||||
if p.get("id") == model_id:
|
|
||||||
return {"id": p["id"], "object": "model", "owned_by": "sidecar"}
|
|
||||||
return JSONResponse(status_code=404, content={"error": "model not found", "id": model_id})
|
|
||||||
|
|
||||||
|
|
||||||
@app.get("/api/tags")
|
|
||||||
async def ollama_tags():
|
|
||||||
"""Ollama-compatible model list for Hermes Desktop discovery."""
|
|
||||||
async with httpx.AsyncClient(timeout=5.0) as client:
|
|
||||||
try:
|
|
||||||
resp = await client.get(f"{SIDECAR_URL}/models/available")
|
|
||||||
profiles = resp.json()
|
|
||||||
except Exception:
|
|
||||||
return JSONResponse(content={"models": []})
|
|
||||||
|
|
||||||
models = []
|
|
||||||
for p in profiles:
|
|
||||||
models.append({
|
|
||||||
"name": p.get("id", ""),
|
|
||||||
"model": p.get("id", ""),
|
|
||||||
"modified_at": "2025-01-01T00:00:00Z",
|
|
||||||
"size": 0,
|
|
||||||
"digest": "",
|
|
||||||
"details": {"format": "gguf", "family": p.get("name", "llm")},
|
|
||||||
})
|
|
||||||
return {"models": models}
|
|
||||||
|
|
||||||
|
|
||||||
@app.get("/api/show")
|
|
||||||
async def ollama_show_get(model: str = ""):
|
|
||||||
"""Ollama-compatible model info for Hermes Desktop discovery (GET variant).
|
|
||||||
|
|
||||||
Some Hermes Desktop versions probe /api/show via GET with a ?model= parameter.
|
|
||||||
"""
|
|
||||||
return await _ollama_show_lookup(model)
|
|
||||||
|
|
||||||
|
|
||||||
@app.post("/api/show")
|
|
||||||
async def ollama_show_post(request: Request):
|
|
||||||
"""Ollama-compatible model info for Hermes Desktop discovery (POST variant)."""
|
|
||||||
body = await request.body()
|
|
||||||
body_data = json.loads(body) if body else {}
|
|
||||||
model_name = body_data.get("model", "")
|
|
||||||
return await _ollama_show_lookup(model_name)
|
|
||||||
|
|
||||||
|
|
||||||
async def _ollama_show_lookup(model_name: str):
|
|
||||||
"""Shared logic for Ollama /api/show model info lookup.
|
|
||||||
|
|
||||||
When model_name is empty string (Hermes Desktop probe with no model field),
|
|
||||||
returns the currently-active profile's info so the desktop can determine
|
|
||||||
the correct context size. Previously returned 404, causing Hermes Desktop
|
|
||||||
to default to 256k context.
|
|
||||||
"""
|
|
||||||
async with httpx.AsyncClient(timeout=5.0) as client:
|
|
||||||
try:
|
|
||||||
resp = await client.get(f"{SIDECAR_URL}/models/available")
|
|
||||||
profiles = resp.json()
|
|
||||||
status_resp = await client.get(f"{SIDECAR_URL}/models/status")
|
|
||||||
status = status_resp.json()
|
|
||||||
except Exception:
|
|
||||||
return JSONResponse(status_code=404, content={"error": "model not found"})
|
|
||||||
|
|
||||||
# If no model specified, return the currently-active profile's info
|
|
||||||
active_id = status.get("active_profile")
|
|
||||||
if not model_name and active_id:
|
|
||||||
for p in profiles:
|
|
||||||
if p.get("id") == active_id:
|
|
||||||
flags = p.get("flags", {})
|
|
||||||
ctx_size = str(flags.get("ctx-size", flags.get("n_ctx", "4096")))
|
|
||||||
return {
|
|
||||||
"modelfile": "",
|
|
||||||
"parameters": f"num_ctx {ctx_size}",
|
|
||||||
"template": "",
|
|
||||||
"details": {
|
|
||||||
"format": "gguf",
|
|
||||||
"family": p.get("name", "llm"),
|
|
||||||
"parameter_size": ctx_size,
|
|
||||||
},
|
|
||||||
"model_info": {"id": p.get("id", "")},
|
|
||||||
}
|
|
||||||
|
|
||||||
for p in profiles:
|
|
||||||
if p.get("id") == model_name:
|
|
||||||
# Extract actual context size from the profile's flags
|
|
||||||
flags = p.get("flags", {})
|
|
||||||
ctx_size = str(flags.get("ctx-size", flags.get("n_ctx", "4096")))
|
|
||||||
return {
|
|
||||||
"modelfile": "",
|
|
||||||
"parameters": f"num_ctx {ctx_size}",
|
|
||||||
"template": "",
|
|
||||||
"details": {
|
|
||||||
"format": "gguf",
|
|
||||||
"family": p.get("name", "llm"),
|
|
||||||
"parameter_size": ctx_size,
|
|
||||||
},
|
|
||||||
"model_info": {"id": p.get("id", "")},
|
|
||||||
}
|
|
||||||
return JSONResponse(status_code=404, content={"error": "model not found"})
|
|
||||||
|
|
||||||
|
|
||||||
@app.get("/api/v1/models")
|
|
||||||
async def ollama_v1_models():
|
|
||||||
"""Ollama /api/v1/models redirect — return same list as /v1/models."""
|
|
||||||
return await get_models()
|
|
||||||
|
|
||||||
|
|
||||||
@app.get("/v1/props")
|
|
||||||
async def llama_cpp_props():
|
|
||||||
"""llama.cpp discovery endpoint for Hermes Desktop."""
|
|
||||||
async with httpx.AsyncClient(timeout=3.0) as client:
|
|
||||||
try:
|
|
||||||
resp = await client.get(f"{SIDECAR_URL}/models/status")
|
|
||||||
status = resp.json()
|
|
||||||
except Exception:
|
|
||||||
status = {"active_profile": None, "llama_server_running": False}
|
|
||||||
|
|
||||||
# Report the currently-running server version / capabilities
|
|
||||||
return {
|
|
||||||
"props": {
|
|
||||||
"version": 1,
|
|
||||||
"total_slots": 1,
|
|
||||||
"chat_endpoint": "/v1/chat/completions",
|
|
||||||
"completion_endpoint": "/v1/completions",
|
|
||||||
"embedding_endpoint": "/v1/embeddings",
|
|
||||||
"rerank_endpoint": "",
|
|
||||||
"health_endpoint": "/health",
|
|
||||||
},
|
|
||||||
"active_profile": status.get("active_profile"),
|
|
||||||
"server_running": status.get("llama_server_running", False),
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
@app.get("/props")
|
|
||||||
async def llm_props():
|
|
||||||
"""Legacy llama.cpp discovery endpoint (same as /v1/props)."""
|
|
||||||
return await llama_cpp_props()
|
|
||||||
|
|
||||||
|
|
||||||
@app.get("/version")
|
|
||||||
async def llm_version():
|
|
||||||
"""llama.cpp version endpoint for Hermes Desktop."""
|
|
||||||
return {"version": "0.2.0", "build": "router-proxy", "commit": "intelligence-router"}
|
|
||||||
|
|
||||||
|
|
||||||
# ─── GET /models/status ──────────────────────────────────────────────────────
|
# ─── GET /models/status ──────────────────────────────────────────────────────
|
||||||
@app.get("/models/status")
|
@app.get("/models/status")
|
||||||
async def router_model_status():
|
async def router_model_status():
|
||||||
@ -471,138 +258,96 @@ async def proxy(
|
|||||||
# ── Determine target URL ──────────────────────────────────────────────
|
# ── Determine target URL ──────────────────────────────────────────────
|
||||||
target_url: Optional[str] = None
|
target_url: Optional[str] = None
|
||||||
error: Optional[str] = None
|
error: Optional[str] = None
|
||||||
sidecar_status = None
|
|
||||||
|
|
||||||
# Always query the sidecar first (to detect recovery even when circuit is open)
|
# Circuit breaker check
|
||||||
async with httpx.AsyncClient(timeout=3.0) as client:
|
if not await circuit_breaker_check():
|
||||||
try:
|
|
||||||
resp = await client.get(f"{SIDECAR_URL}/models/status")
|
|
||||||
if resp.status_code == 200:
|
|
||||||
sidecar_status = resp.json()
|
|
||||||
circuit_reset()
|
|
||||||
except Exception:
|
|
||||||
pass # Handled below
|
|
||||||
|
|
||||||
if sidecar_status is None:
|
|
||||||
circuit_record_failure()
|
|
||||||
error = "sidecar_down"
|
|
||||||
elif not await circuit_breaker_check():
|
|
||||||
# Sidecar is up but circuit is open from prior switch failures
|
|
||||||
# Only block the switch — allow routing to already-active backend
|
|
||||||
error = "circuit_open"
|
error = "circuit_open"
|
||||||
if sidecar_status.get("llama_server_running"):
|
|
||||||
target_url = f"{MAIN_PC_BASE}/{path}"
|
|
||||||
else:
|
else:
|
||||||
# Both sidecar reachable and circuit closed — proceed normally
|
# Query Sidecar for active model
|
||||||
body = await request.body()
|
sidecar_status = None
|
||||||
body_data = json.loads(body) if body else {}
|
async with httpx.AsyncClient(timeout=3.0) as client:
|
||||||
requested_model = body_data.get("model")
|
|
||||||
|
|
||||||
# Only trigger model switches for actual chat/completion POST requests.
|
|
||||||
# GET probes, /api/show lookups, and other non-chat endpoints should
|
|
||||||
# never trigger a switch — they just read current state.
|
|
||||||
is_chat_request = (
|
|
||||||
request.method == "POST"
|
|
||||||
and path in ("v1/chat/completions", "v1/completions")
|
|
||||||
)
|
|
||||||
|
|
||||||
if requested_model and sidecar_status.get("active_profile") == requested_model:
|
|
||||||
target_url = f"{MAIN_PC_BASE}/{path}"
|
|
||||||
elif requested_model and is_chat_request:
|
|
||||||
# All requests during a model switch get an immediate SSE streaming
|
|
||||||
# response so clients (Hermes Desktop) don't timeout while waiting
|
|
||||||
# for the model to load (10-30s). The switch runs in a background
|
|
||||||
# task; the SSE stream yields progress events, then pipes through
|
|
||||||
# the actual response once the backend model is ready.
|
|
||||||
current_switch = await wait_for_switch()
|
|
||||||
if current_switch is None:
|
|
||||||
# No switch in progress — start one in the background
|
|
||||||
await start_switch()
|
|
||||||
asyncio.create_task(_background_switch(requested_model))
|
|
||||||
|
|
||||||
# Queue this request — signals when switch completes
|
|
||||||
try:
|
try:
|
||||||
wait_evt = await queue_request()
|
resp = await client.get(f"{SIDECAR_URL}/models/status")
|
||||||
except HTTPException as he:
|
if resp.status_code == 200:
|
||||||
raise
|
sidecar_status = resp.json()
|
||||||
|
circuit_reset()
|
||||||
# Build request headers once
|
except Exception:
|
||||||
req_headers = dict(request.headers)
|
error = "sidecar_down"
|
||||||
req_headers.pop("host", None)
|
|
||||||
|
|
||||||
async def stream_with_sse():
|
|
||||||
sse_gen = sse_progress_stream(wait_evt)
|
|
||||||
try:
|
|
||||||
await wait_evt.wait()
|
|
||||||
async for sse_chunk in sse_gen:
|
|
||||||
yield sse_chunk
|
|
||||||
# Send actual request to Main PC
|
|
||||||
async with httpx.AsyncClient(timeout=60.0) as c:
|
|
||||||
async with c.stream(
|
|
||||||
request.method,
|
|
||||||
f"{MAIN_PC_BASE}/{path}",
|
|
||||||
content=body,
|
|
||||||
headers=req_headers,
|
|
||||||
) as resp:
|
|
||||||
async for chunk in resp.aiter_bytes():
|
|
||||||
yield chunk
|
|
||||||
except Exception:
|
|
||||||
# Main PC unreachable (switch failed or server died) —
|
|
||||||
# try fallback chain
|
|
||||||
yield _sse_format(
|
|
||||||
"error",
|
|
||||||
{"message": "Backend unreachable, trying fallback..."},
|
|
||||||
)
|
|
||||||
# Try OpenRouter
|
|
||||||
if OPENROUTER_API_KEY:
|
|
||||||
try:
|
|
||||||
fb_headers = dict(req_headers)
|
|
||||||
fb_headers["Authorization"] = f"Bearer {OPENROUTER_API_KEY}"
|
|
||||||
async with httpx.AsyncClient(timeout=60.0) as c:
|
|
||||||
async with c.stream(
|
|
||||||
request.method,
|
|
||||||
f"{OPENROUTER_BASE}/{path}",
|
|
||||||
content=body,
|
|
||||||
headers=fb_headers,
|
|
||||||
) as resp:
|
|
||||||
async for chunk in resp.aiter_bytes():
|
|
||||||
yield chunk
|
|
||||||
return
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
# Fallback to LXC SLM
|
|
||||||
try:
|
|
||||||
async with httpx.AsyncClient(timeout=60.0) as c:
|
|
||||||
async with c.stream(
|
|
||||||
request.method,
|
|
||||||
f"{FALLBACK_SLM_URL}/{path}",
|
|
||||||
content=body,
|
|
||||||
headers=req_headers,
|
|
||||||
) as resp:
|
|
||||||
async for chunk in resp.aiter_bytes():
|
|
||||||
yield chunk
|
|
||||||
except Exception:
|
|
||||||
yield _sse_format(
|
|
||||||
"error",
|
|
||||||
{"message": "All backends unavailable"},
|
|
||||||
)
|
|
||||||
finally:
|
|
||||||
try:
|
|
||||||
await sse_gen.aclose()
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
|
|
||||||
return StreamingResponse(
|
|
||||||
stream_with_sse(),
|
|
||||||
media_type="text/event-stream",
|
|
||||||
)
|
|
||||||
|
|
||||||
|
if sidecar_status is None:
|
||||||
|
circuit_record_failure()
|
||||||
|
error = "sidecar_down"
|
||||||
else:
|
else:
|
||||||
# No model in request body (probe/GET/non-chat request) —
|
# Extract requested model from request body
|
||||||
# route to the currently active backend when available,
|
body = await request.body()
|
||||||
# or fall through to the fallback chain.
|
body_data = json.loads(body) if body else {}
|
||||||
if sidecar_status.get("active_profile") and sidecar_status.get("llama_server_running"):
|
requested_model = body_data.get("model")
|
||||||
|
|
||||||
|
if requested_model and sidecar_status.get("active_profile") == requested_model:
|
||||||
target_url = f"{MAIN_PC_BASE}/{path}"
|
target_url = f"{MAIN_PC_BASE}/{path}"
|
||||||
|
else:
|
||||||
|
# Trigger switch
|
||||||
|
if requested_model:
|
||||||
|
# Check if a switch is already in progress
|
||||||
|
current_switch = await wait_for_switch()
|
||||||
|
|
||||||
|
if current_switch is not None and not current_switch.is_set():
|
||||||
|
# Another request started the switch — queue this one
|
||||||
|
try:
|
||||||
|
wait_evt = await queue_request()
|
||||||
|
except HTTPException as he:
|
||||||
|
raise
|
||||||
|
|
||||||
|
# SSE progress while waiting
|
||||||
|
async def stream_with_sse():
|
||||||
|
sse_gen = sse_progress_stream(wait_evt)
|
||||||
|
try:
|
||||||
|
await wait_evt.wait()
|
||||||
|
async for sse_chunk in sse_gen:
|
||||||
|
yield sse_chunk
|
||||||
|
complete_switch()
|
||||||
|
drain_queue()
|
||||||
|
async with httpx.AsyncClient(timeout=60.0) as c:
|
||||||
|
req_headers = dict(request.headers)
|
||||||
|
req_headers.pop("host", None)
|
||||||
|
async with c.stream(
|
||||||
|
request.method,
|
||||||
|
f"{MAIN_PC_BASE}/{path}",
|
||||||
|
content=body,
|
||||||
|
headers=req_headers,
|
||||||
|
) as resp:
|
||||||
|
async for chunk in resp.aiter_bytes():
|
||||||
|
yield chunk
|
||||||
|
finally:
|
||||||
|
# Clean up sse_gen
|
||||||
|
try:
|
||||||
|
await sse_gen.aclose()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return StreamingResponse(
|
||||||
|
stream_with_sse(),
|
||||||
|
media_type="text/event-stream",
|
||||||
|
)
|
||||||
|
|
||||||
|
# First request triggers the switch
|
||||||
|
await start_switch() # Create event for tracking
|
||||||
|
try:
|
||||||
|
async with httpx.AsyncClient(timeout=120.0) as client:
|
||||||
|
switch_resp = await client.post(
|
||||||
|
f"{SIDECAR_URL}/models/switch",
|
||||||
|
json={"profile_id": requested_model},
|
||||||
|
)
|
||||||
|
switch_result = switch_resp.json()
|
||||||
|
if switch_result.get("status") == "ready":
|
||||||
|
complete_switch()
|
||||||
|
drain_queue()
|
||||||
|
target_url = f"{MAIN_PC_BASE}/{path}"
|
||||||
|
else:
|
||||||
|
error = "switch_failed"
|
||||||
|
except Exception as e:
|
||||||
|
circuit_record_failure()
|
||||||
|
error = f"switch_error: {str(e)}"
|
||||||
|
|
||||||
# ── Fallback chain ────────────────────────────────────────────────────
|
# ── Fallback chain ────────────────────────────────────────────────────
|
||||||
if target_url is None:
|
if target_url is None:
|
||||||
@ -633,11 +378,8 @@ async def proxy(
|
|||||||
request.method, target,
|
request.method, target,
|
||||||
content=body, headers=headers,
|
content=body, headers=headers,
|
||||||
) as resp:
|
) as resp:
|
||||||
if resp.status_code != 200:
|
|
||||||
print(f"PROXY: {target} returned {resp.status_code} during SSE stream", flush=True)
|
|
||||||
async for chunk in resp.aiter_bytes():
|
async for chunk in resp.aiter_bytes():
|
||||||
yield chunk
|
yield chunk
|
||||||
|
|
||||||
return StreamingResponse(gen(), status_code=200)
|
return StreamingResponse(gen(), status_code=200)
|
||||||
|
|
||||||
resp = await client.request(
|
resp = await client.request(
|
||||||
@ -646,12 +388,6 @@ async def proxy(
|
|||||||
content=body,
|
content=body,
|
||||||
headers=headers,
|
headers=headers,
|
||||||
)
|
)
|
||||||
if resp.status_code != 200:
|
|
||||||
body_preview = resp.content[:500].decode("utf-8", errors="replace")
|
|
||||||
print(
|
|
||||||
f"PROXY: {request.method} {target} returned {resp.status_code}: {body_preview}",
|
|
||||||
flush=True,
|
|
||||||
)
|
|
||||||
return Response(
|
return Response(
|
||||||
content=resp.content,
|
content=resp.content,
|
||||||
status_code=resp.status_code,
|
status_code=resp.status_code,
|
||||||
@ -661,11 +397,8 @@ async def proxy(
|
|||||||
primary_result = None
|
primary_result = None
|
||||||
try:
|
try:
|
||||||
primary_result = await execute(target_url)
|
primary_result = await execute(target_url)
|
||||||
except Exception as e:
|
except Exception:
|
||||||
print(
|
pass # Falls through to fallback chain
|
||||||
f"PROXY EXCEPTION on primary {target_url}: {type(e).__name__}: {e}",
|
|
||||||
flush=True,
|
|
||||||
) # Falls through to fallback chain
|
|
||||||
if primary_result is not None:
|
if primary_result is not None:
|
||||||
return primary_result
|
return primary_result
|
||||||
|
|
||||||
|
|||||||
@ -1,161 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Sync intelligence-router model list into Hermes custom_providers.
|
|
||||||
|
|
||||||
Usage:
|
|
||||||
# One-shot: discover models from the router and update Hermes config
|
|
||||||
python3 scripts/sync_models.py
|
|
||||||
|
|
||||||
# Cron mode (auto): set up via:
|
|
||||||
# cp scripts/sync_models.py ~/.hermes/scripts/
|
|
||||||
# hermes cron create --schedule "every 30m" --no-agent --script sync_models.py
|
|
||||||
|
|
||||||
Silent exit when nothing changed. Prints a summary + restarts the gateway when
|
|
||||||
the model list differs.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import os
|
|
||||||
import subprocess
|
|
||||||
import sys
|
|
||||||
import urllib.error
|
|
||||||
import urllib.request
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
# ── CONFIGURE THESE ──────────────────────────────────────────────────
|
|
||||||
ROUTER_BASE_URL = "http://10.0.4.100:9001/v1"
|
|
||||||
PROVIDER_NAME = "intelligence_router"
|
|
||||||
GATEWAY_SERVICE = "hermes-gateway"
|
|
||||||
# ─────────────────────────────────────────────────────────────────────
|
|
||||||
|
|
||||||
MODELS_URL = f"{ROUTER_BASE_URL}/models"
|
|
||||||
CONFIG_PATH = Path(os.path.expanduser("~/.hermes/config.yaml"))
|
|
||||||
|
|
||||||
|
|
||||||
def fetch_models() -> list[str] | None:
|
|
||||||
try:
|
|
||||||
req = urllib.request.Request(MODELS_URL, headers={"Accept": "application/json"})
|
|
||||||
with urllib.request.urlopen(req, timeout=10) as resp:
|
|
||||||
data = json.loads(resp.read().decode())
|
|
||||||
models = sorted(m["id"] for m in data.get("data", []) if isinstance(m, dict))
|
|
||||||
return models if models else None
|
|
||||||
except (urllib.error.URLError, TimeoutError, json.JSONDecodeError, OSError) as e:
|
|
||||||
print(f"ERROR: Failed to fetch models from {MODELS_URL}: {e}", file=sys.stderr)
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
def read_current_models() -> list[str]:
|
|
||||||
"""Parse current custom_providers entries for our provider name."""
|
|
||||||
if not CONFIG_PATH.exists():
|
|
||||||
return []
|
|
||||||
|
|
||||||
models = []
|
|
||||||
with open(CONFIG_PATH) as f:
|
|
||||||
content = f.read()
|
|
||||||
|
|
||||||
idx = content.find("custom_providers:")
|
|
||||||
if idx == -1:
|
|
||||||
return []
|
|
||||||
|
|
||||||
section = content[idx:]
|
|
||||||
lines = section.split("\n")
|
|
||||||
|
|
||||||
current_entry = {}
|
|
||||||
for line in lines:
|
|
||||||
s = line.strip()
|
|
||||||
if s.startswith("- base_url:"):
|
|
||||||
if current_entry.get("name") == PROVIDER_NAME:
|
|
||||||
m = current_entry.get("model", "")
|
|
||||||
if m:
|
|
||||||
models.append(m)
|
|
||||||
current_entry = {}
|
|
||||||
elif s.startswith("model:"):
|
|
||||||
current_entry["model"] = s.split("model:", 1)[1].strip().strip("'\"")
|
|
||||||
elif s.startswith("name:"):
|
|
||||||
current_entry["name"] = s.split("name:", 1)[1].strip().strip("'\"")
|
|
||||||
elif s and not s.startswith(("-", " ")):
|
|
||||||
break
|
|
||||||
|
|
||||||
# Don't forget the last entry
|
|
||||||
if current_entry.get("name") == PROVIDER_NAME:
|
|
||||||
m = current_entry.get("model", "")
|
|
||||||
if m:
|
|
||||||
models.append(m)
|
|
||||||
|
|
||||||
return sorted(models)
|
|
||||||
|
|
||||||
|
|
||||||
def generate_block(models: list[str]) -> str:
|
|
||||||
lines = ["custom_providers:"]
|
|
||||||
for m in models:
|
|
||||||
lines.append(f"- base_url: {ROUTER_BASE_URL}")
|
|
||||||
lines.append(f" model: {m}")
|
|
||||||
lines.append(f" name: {PROVIDER_NAME}")
|
|
||||||
return "\n".join(lines)
|
|
||||||
|
|
||||||
|
|
||||||
def replace_section(models: list[str]) -> bool:
|
|
||||||
"""Replace the custom_providers section in-place. Returns True if changed."""
|
|
||||||
if not CONFIG_PATH.exists():
|
|
||||||
return False
|
|
||||||
|
|
||||||
import yaml
|
|
||||||
|
|
||||||
content = CONFIG_PATH.read_text()
|
|
||||||
config = yaml.safe_load(content)
|
|
||||||
|
|
||||||
new_entries = [
|
|
||||||
{"base_url": ROUTER_BASE_URL, "model": m, "name": PROVIDER_NAME}
|
|
||||||
for m in models
|
|
||||||
]
|
|
||||||
|
|
||||||
if config.get("custom_providers") == new_entries:
|
|
||||||
return False
|
|
||||||
|
|
||||||
config["custom_providers"] = new_entries
|
|
||||||
CONFIG_PATH.write_text(yaml.dump(config, default_flow_style=False, sort_keys=False))
|
|
||||||
return True
|
|
||||||
|
|
||||||
|
|
||||||
def restart_gateway() -> bool:
|
|
||||||
try:
|
|
||||||
r = subprocess.run(
|
|
||||||
["systemctl", "--user", "restart", GATEWAY_SERVICE],
|
|
||||||
capture_output=True, text=True, timeout=30,
|
|
||||||
)
|
|
||||||
return r.returncode == 0
|
|
||||||
except Exception:
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
models = fetch_models()
|
|
||||||
if models is None:
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
current = read_current_models()
|
|
||||||
if current == models:
|
|
||||||
print("Model list unchanged — nothing to do.")
|
|
||||||
return
|
|
||||||
|
|
||||||
added = set(models) - set(current)
|
|
||||||
removed = set(current) - set(models)
|
|
||||||
print(f"Model list changed! {len(current)} → {len(models)} models")
|
|
||||||
if added:
|
|
||||||
print(f" Added: {sorted(added)}")
|
|
||||||
if removed:
|
|
||||||
print(f" Removed: {sorted(removed)}")
|
|
||||||
|
|
||||||
if not replace_section(models):
|
|
||||||
print("ERROR: Config update failed")
|
|
||||||
return
|
|
||||||
|
|
||||||
print("Config updated. Restarting gateway...")
|
|
||||||
if restart_gateway():
|
|
||||||
print("Gateway restarted successfully.")
|
|
||||||
else:
|
|
||||||
print("WARNING: Gateway restart failed — restart manually.")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
133
sidecar/app.py
133
sidecar/app.py
@ -5,6 +5,7 @@ Runs on the Main PC, manages llama-server subprocess, serves manifest/profile da
|
|||||||
import os
|
import os
|
||||||
import asyncio
|
import asyncio
|
||||||
import signal as signal_module
|
import signal as signal_module
|
||||||
|
import threading
|
||||||
from contextlib import asynccontextmanager
|
from contextlib import asynccontextmanager
|
||||||
from typing import Optional
|
from typing import Optional
|
||||||
|
|
||||||
@ -17,98 +18,41 @@ from sidecar.manifest import load_manifest
|
|||||||
# Configuration from environment
|
# Configuration from environment
|
||||||
MANIFEST_PATH = os.getenv("MANIFEST_PATH", "/home/bigt/AI/llm/manifest.yaml")
|
MANIFEST_PATH = os.getenv("MANIFEST_PATH", "/home/bigt/AI/llm/manifest.yaml")
|
||||||
SIDECAR_PORT = int(os.getenv("SIDECAR_PORT", "8080"))
|
SIDECAR_PORT = int(os.getenv("SIDECAR_PORT", "8080"))
|
||||||
LLAMA_SERVER_PORT = 8081
|
LLAMA_SERVER_PORT = 8080
|
||||||
LLAMA_STDERR_LOG = os.path.join(
|
|
||||||
os.path.dirname(MANIFEST_PATH), "llama-server-stderr.log"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Global state
|
# Global state
|
||||||
_llama_server_process: Optional[asyncio.subprocess.Process] = None
|
_llama_server_process: Optional[asyncio.subprocess.Process] = None
|
||||||
_active_profile: Optional[str] = None
|
_active_profile: Optional[str] = None
|
||||||
_switch_lock = asyncio.Lock() # Use asyncio.Lock to avoid blocking the event loop
|
_switch_lock = threading.Lock() # Use threading.Lock for compatibility with TestClient
|
||||||
|
|
||||||
|
|
||||||
@asynccontextmanager
|
@asynccontextmanager
|
||||||
async def lifespan(app: FastAPI):
|
async def lifespan(app: FastAPI):
|
||||||
"""Manage sidecar lifecycle — no default model loaded."""
|
"""Manage sidecar lifecycle — no default model loaded."""
|
||||||
print(f"Sidecar starting, manifest={MANIFEST_PATH}, port={SIDECAR_PORT}", flush=True)
|
print(f"Sidecar starting, manifest={MANIFEST_PATH}, port={SIDECAR_PORT}")
|
||||||
yield
|
yield
|
||||||
# Cleanup: kill llama-server if running
|
# Cleanup: kill llama-server if running
|
||||||
global _llama_server_process
|
global _llama_server_process
|
||||||
if _llama_server_process:
|
if _llama_server_process:
|
||||||
await _kill_llama_server()
|
_kill_llama_server()
|
||||||
|
|
||||||
|
|
||||||
app = FastAPI(lifespan=lifespan)
|
app = FastAPI(lifespan=lifespan)
|
||||||
|
|
||||||
|
|
||||||
def _close_stderr_log():
|
def _kill_llama_server():
|
||||||
"""Close the stderr log file handle if it's still attached to the process."""
|
"""Kill the llama-server subprocess."""
|
||||||
global _llama_server_process
|
global _llama_server_process
|
||||||
if _llama_server_process is not None:
|
if _llama_server_process and _llama_server_process.returncode is None:
|
||||||
fh = getattr(_llama_server_process, "_stderr_fh", None)
|
|
||||||
if fh is not None and not fh.closed:
|
|
||||||
try:
|
|
||||||
fh.close()
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
|
|
||||||
|
|
||||||
async def _kill_llama_server():
|
|
||||||
"""Kill the llama-server subprocess and wait for it to fully terminate.
|
|
||||||
|
|
||||||
This MUST be async because process.wait() is a coroutine. The synchronous
|
|
||||||
version was calling .wait() without await, creating an unawaited coroutine
|
|
||||||
object — the old process was never actually waited on, so it could still
|
|
||||||
hold GPU VRAM when the new server started.
|
|
||||||
"""
|
|
||||||
global _llama_server_process
|
|
||||||
if _llama_server_process is None or _llama_server_process.returncode is not None:
|
|
||||||
_close_stderr_log()
|
|
||||||
return
|
|
||||||
|
|
||||||
try:
|
|
||||||
_llama_server_process.send_signal(signal_module.SIGTERM)
|
|
||||||
try:
|
try:
|
||||||
await asyncio.wait_for(_llama_server_process.wait(), timeout=10)
|
_llama_server_process.send_signal(signal_module.SIGTERM)
|
||||||
except asyncio.TimeoutError:
|
|
||||||
_llama_server_process.kill()
|
|
||||||
try:
|
try:
|
||||||
await asyncio.wait_for(_llama_server_process.wait(), timeout=5)
|
_llama_server_process.wait(timeout=5)
|
||||||
except asyncio.TimeoutError:
|
except asyncio.TimeoutError:
|
||||||
pass
|
_llama_server_process.kill()
|
||||||
except Exception:
|
except Exception:
|
||||||
pass
|
pass
|
||||||
finally:
|
|
||||||
_llama_server_process = None
|
_llama_server_process = None
|
||||||
_close_stderr_log()
|
|
||||||
|
|
||||||
|
|
||||||
def _flag_value(value) -> str:
|
|
||||||
"""Convert a manifest flag value to a llama-server CLI argument string.
|
|
||||||
|
|
||||||
YAML booleans (True/False/on/off/yes/no) are parsed as Python bools by
|
|
||||||
safe_load. llama-server expects 'on'/'off' for boolean flags, not 'True'/'False'.
|
|
||||||
"""
|
|
||||||
if isinstance(value, bool):
|
|
||||||
return "on" if value else "off"
|
|
||||||
return str(value)
|
|
||||||
|
|
||||||
|
|
||||||
def _flag_key(key: str) -> str:
|
|
||||||
"""Convert a manifest flag key to the correct llama-server CLI flag name.
|
|
||||||
|
|
||||||
llama-server uses hyphenated flag names (--ctx-size, --n-gpu-layers),
|
|
||||||
but YAML keys often use underscores. Some flags were also renamed
|
|
||||||
across llama.cpp versions (e.g. --n-ctx → --ctx-size).
|
|
||||||
|
|
||||||
This function normalises underscores to hyphens and applies known renames.
|
|
||||||
"""
|
|
||||||
normalized = key.replace("_", "-")
|
|
||||||
FLAG_RENAMES = {
|
|
||||||
"n-ctx": "ctx-size",
|
|
||||||
}
|
|
||||||
return FLAG_RENAMES.get(normalized, normalized)
|
|
||||||
|
|
||||||
|
|
||||||
async def _start_llama_server(profile: dict):
|
async def _start_llama_server(profile: dict):
|
||||||
@ -116,39 +60,29 @@ async def _start_llama_server(profile: dict):
|
|||||||
global _llama_server_process
|
global _llama_server_process
|
||||||
|
|
||||||
# Kill any existing process
|
# Kill any existing process
|
||||||
await _kill_llama_server()
|
_kill_llama_server()
|
||||||
|
|
||||||
# Build command from profile flags
|
# Build command from profile flags
|
||||||
cmd = ["/home/bigt/AI/llama.cpp/build/bin/llama-server"]
|
cmd = ["llama-server"]
|
||||||
cmd += ["--model", profile["model_path"]]
|
cmd += ["--model", profile["model_path"]]
|
||||||
cmd += ["--port", str(LLAMA_SERVER_PORT)]
|
cmd += ["--port", str(LLAMA_SERVER_PORT)]
|
||||||
cmd += ["--host", "0.0.0.0"]
|
|
||||||
for key, value in profile.get("flags", {}).items():
|
for key, value in profile.get("flags", {}).items():
|
||||||
cmd += ["--" + _flag_key(key), _flag_value(value)]
|
cmd += ["--" + key, str(value)]
|
||||||
|
|
||||||
print(f"Starting llama-server: {' '.join(cmd)}", flush=True)
|
print(f"Starting llama-server: {' '.join(cmd)}")
|
||||||
|
|
||||||
# Capture stderr so we can diagnose crashes (model not found, OOM, bad flag)
|
|
||||||
stderr_fh = open(LLAMA_STDERR_LOG, "w")
|
|
||||||
_llama_server_process = await asyncio.create_subprocess_exec(
|
_llama_server_process = await asyncio.create_subprocess_exec(
|
||||||
*cmd,
|
*cmd,
|
||||||
stdout=asyncio.subprocess.DEVNULL,
|
stdout=asyncio.subprocess.DEVNULL,
|
||||||
stderr=stderr_fh,
|
stderr=asyncio.subprocess.DEVNULL,
|
||||||
)
|
)
|
||||||
# Keep a reference so we can close the handle later
|
|
||||||
_llama_server_process._stderr_fh = stderr_fh # type: ignore[attr-defined]
|
|
||||||
return _llama_server_process
|
return _llama_server_process
|
||||||
|
|
||||||
|
|
||||||
async def _poll_llama_server_ready(max_retries: int = 60, interval: float = 0.5):
|
async def _poll_llama_server_ready(max_retries: int = 240, interval: float = 0.5):
|
||||||
"""Poll llama-server readiness via /v1/models endpoint.
|
"""Poll llama-server readiness via /v1/models endpoint."""
|
||||||
|
|
||||||
Returns True on success. On failure, dumps the captured stderr (if any)
|
|
||||||
so the user can see why llama-server crashed.
|
|
||||||
"""
|
|
||||||
import httpx
|
import httpx
|
||||||
|
|
||||||
for attempt in range(max_retries):
|
for _ in range(max_retries):
|
||||||
try:
|
try:
|
||||||
async with httpx.AsyncClient(timeout=2.0) as client:
|
async with httpx.AsyncClient(timeout=2.0) as client:
|
||||||
resp = await client.get(f"http://localhost:{LLAMA_SERVER_PORT}/v1/models")
|
resp = await client.get(f"http://localhost:{LLAMA_SERVER_PORT}/v1/models")
|
||||||
@ -157,27 +91,6 @@ async def _poll_llama_server_ready(max_retries: int = 60, interval: float = 0.5)
|
|||||||
except Exception:
|
except Exception:
|
||||||
pass
|
pass
|
||||||
await asyncio.sleep(interval)
|
await asyncio.sleep(interval)
|
||||||
|
|
||||||
# Flush and close the stderr handle so all data is on disk before we read
|
|
||||||
_close_stderr_log()
|
|
||||||
|
|
||||||
# ── Dump stderr for diagnosis ──────────────────────────────────────
|
|
||||||
print("llama-server did NOT become ready — dumping stderr:", flush=True)
|
|
||||||
try:
|
|
||||||
with open(LLAMA_STDERR_LOG) as f:
|
|
||||||
for line in f:
|
|
||||||
print(f" {line.rstrip()}", flush=True)
|
|
||||||
except FileNotFoundError:
|
|
||||||
print(" (stderr log not found — process may not have started)", flush=True)
|
|
||||||
|
|
||||||
# Also log exit code if the process died
|
|
||||||
global _llama_server_process
|
|
||||||
if _llama_server_process and _llama_server_process.returncode is not None:
|
|
||||||
print(
|
|
||||||
f"llama-server exited with code {_llama_server_process.returncode}",
|
|
||||||
flush=True,
|
|
||||||
)
|
|
||||||
|
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
@ -211,7 +124,7 @@ async def switch_model(payload: SwitchRequest):
|
|||||||
"""Stop current llama-server, start new one with the given profile, wait for readiness."""
|
"""Stop current llama-server, start new one with the given profile, wait for readiness."""
|
||||||
global _active_profile
|
global _active_profile
|
||||||
|
|
||||||
async with _switch_lock:
|
with _switch_lock:
|
||||||
# Validate profile_id
|
# Validate profile_id
|
||||||
profiles = load_manifest(MANIFEST_PATH)
|
profiles = load_manifest(MANIFEST_PATH)
|
||||||
if profiles is None:
|
if profiles is None:
|
||||||
@ -240,7 +153,7 @@ async def switch_model(payload: SwitchRequest):
|
|||||||
}
|
}
|
||||||
|
|
||||||
# Start the new model
|
# Start the new model
|
||||||
await _kill_llama_server()
|
_kill_llama_server()
|
||||||
_active_profile = None
|
_active_profile = None
|
||||||
await _start_llama_server(profile)
|
await _start_llama_server(profile)
|
||||||
|
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user