root 2c23faa4a1 fix: add probe endpoints and no-model fallback for Hermes Desktop compatibility

Hermes Desktop sends probe requests to validate providers before allowing
model switching. The router was returning 503 for all of these because
the catch-all proxy requires a 'model' field in the request body.

Added explicit handlers for:
- GET /v1/models/{model_id} — OpenAI single-model lookup
- GET /api/tags — Ollama model list discovery
- POST /api/show — Ollama model info
- GET /api/v1/models — Ollama-compatible model list
- GET /v1/props, GET /props — llama.cpp server properties
- GET /version — llama.cpp version

Also fixed the catch-all proxy to route requests with no model body to
the currently active backend instead of returning 503.

2026-06-15 15:22:15 +00:00

5.3 KiB

Raw Blame History

Plan: Add user model profiles to manifest.yaml

Date: 2025-06-15

Author: Hermes Agent

Status: DRAFT

Context

User has a collection of GGUF models on their Main PC (10.0.4.11, RTX 3090 24GB VRAM). The intelligence-router manifest needs profiles for all models with researched llama.cpp parameters. Research sourced from r/LocalLLaMA, HuggingFace model cards, and community blog posts.

Hardware constraints

GPU: RTX 3090, 24GB VRAM
All profiles use n_gpu_layers: 999 (offload all layers that fit)
All profiles use flash-attn: on
KV cache quantization (q8_0 or q4_0) to enable 64K+ context within 24GB VRAM
min_p set to 0.0 across all profiles (community standard for these models)

Models to add (excluding mmproj files)

Qwen3.6-27B (1 file: Qwen3.6-27B-Q4_K_M.gguf, ~10.5 GB)

Recommended sampling per HF model card and Unsloth: temp 0.6 / top_p 0.95 / top_k 20

#	Profile ID	Name	n_ctx	cache_k/v	temp	top_k	repeat_pen
1	qwen36-27b-balanced-64k	Qwen3.6-27B Balanced 64K	65536	q8_0/q8_0	0.6	20	1.0
2	qwen36-27b-thinking-64k	Qwen3.6-27B Thinking 64K	65536	q8_0/q8_0	1.0	20	1.0
3	qwen36-27b-extended-128k	Qwen3.6-27B Extended 128K	131072	q4_0/q4_0	0.6	20	1.05

Gemma 4 12B (2 files: Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB)

Google official: temp 1.0 / top_p 0.95 / top_k 64

#	Profile ID	Name	File	n_ctx	cache_k/v	temp	top_k
4	gemma4-12b-standard-q6-64k	Gemma4 12B Standard Q6 64K	Q6_K_XL	65536	q8_0/q8_0	1.0	64
5	gemma4-12b-extended-q6-128k	Gemma4 12B Extended Q6 128K	Q6_K_XL	131072	q4_0/q4_0	1.0	64
6	gemma4-12b-compact-iq4-64k	Gemma4 12B Compact IQ4 64K	IQ4_XS	65536	q8_0/q8_0	1.0	64
7	gemma4-12b-compact-long-128k	Gemma4 12B Compact IQ4 128K	IQ4_XS	131072	q8_0/q8_0	1.0	64

Gemma 4 26B-A4B (2 files: Q4_K_M ~10.5 GB, IQ4_XS ~6 GB)

MoE, 4B active. Same sampling as 12B family.

#	Profile ID	Name	File	n_ctx	cache_k/v	temp	top_k	repeat_pen
8	gemma4-26b-balanced-64k	Gemma4 26B Balanced 64K	Q4_K_M	65536	q8_0/q8_0	1.0	64	1.0
9	gemma4-26b-extended-128k	Gemma4 26B Extended 128K	Q4_K_M	131072	q4_0/q4_0	1.0	64	1.15
10	gemma4-26b-ultra-long-iq4-256k	Gemma4 26B Ultra-Long IQ4 256K	IQ4_XS	262144	q4_0/q4_0	1.0	64	1.0

Qwen3.6-35B-A3B (2 files: UD-Q4_K_M ~14 GB, MTP-UD-Q4_K_M ~16 GB)

MTP note: Unsloth benchmark shows MTP is net-negative on single 3090. Including MTP profile anyway since user has the file.

#	Profile ID	Name	File	n_ctx	cache_k/v	temp	top_k	MTP
11	qwen36-35b-fast-64k	Qwen3.6-35B Fast 64K	UD-Q4	65536	q8_0/q8_0	0.6	20	no
12	qwen36-35b-thinking-64k	Qwen3.6-35B Thinking 64K	UD-Q4	65536	q8_0/q8_0	1.0	20	no
13	qwen36-35b-extended-128k	Qwen3.6-35B Extended 128K	UD-Q4	131072	q4_0/q4_0	0.6	20	no
14	qwen36-35b-mtp-128k	Qwen3.6-35B MTP 128K	MTP-UD-Q4	131072	q8_0/q8_0	0.6	20	yes (n=3)

Uncensored models (apply censored family params)

#	Profile ID	Name	File	n_ctx	cache_k/v	temp	top_k	Based on
15	qwen36-35b-hauhau-aggressive-64k	Qwen3.6-35B HauhauCS Aggressive 64K	Uncensored-HauhauCS-Q4_K_P	65536	q8_0/q8_0	0.6	20	Qwen3.6-35B fast
16	qwen36-35b-genesis-apex-64k	Qwen3.6-35B Genesis APEX 64K	Uncensored-Genesis-APEX	65536	q8_0/q8_0	0.6	20	Qwen3.6-35B fast
17	qwen36-35b-genesis-mtp-apex-128k	Qwen3.6-35B Genesis MTP APEX 128K	Uncensored-Genesis-MTP-APEX	131072	q8_0/q8_0	0.6	20	Qwen3.6-35B MTP
18	gemma4-26b-hauhau-balanced-64k	Gemma4 26B HauhauCS Balanced 64K	Uncensored-HauhauCS-Q5_K_M	65536	q8_0/q8_0	1.0	64	Gemma4 26B balanced

Total: 18 profiles

Flag mapping (manifest → llama-server CLI)

Manifest flags use camelCase keys that the sidecar passes as --key value to llama-server:

Manifest key	CLI flag	Type	Notes
n_gpu_layers	--n-gpu-layers	int	999 = all
n_ctx	--ctx-size	int	context window
cache_type_k	--cache-type-k	str	q8_0, q4_0
cache_type_v	--cache-type-v	str	q8_0, q4_0
flash_attn	--flash-attn	bool	true/on
temp	--temp	float	sampling
top_p	--top-p	float	sampling
top_k	--top-k	int	sampling
repeat_penalty	--repeat-penalty	float	sampling
min_p	--min-p	float	0.0
spec_type	--spec-type	str	draft-mtp (only MTP profiles)
spec_draft_n_max	--spec-draft-n-max	int	3 (only MTP profiles)
presence_penalty	--presence-penalty	float	0.0

Actions

Create branch feature/add-model-profiles from master
Create issues on Gitea for each model family (4 issues: qwen27, gemma12b, gemma26b, qwen35b)
Update deploy/manifest.yaml with all 18 profiles
Update tests if flag structure requires it
Run tests, commit

5.3 KiB Raw Blame History