intelligence-router/.hermes/plans/add-model-profiles.md

# Plan: Add user model profiles to manifest.yaml
# Date: 2025-06-15
# Author: Hermes Agent
# Status: DRAFT

## Context
User has a collection of GGUF models on their Main PC (10.0.4.11, RTX 3090 24GB VRAM).
The intelligence-router manifest needs profiles for all models with researched llama.cpp parameters.
Research sourced from r/LocalLLaMA, HuggingFace model cards, and community blog posts.

## Hardware constraints
- GPU: RTX 3090, 24GB VRAM
- All profiles use `n_gpu_layers: 999` (offload all layers that fit)
- All profiles use `flash-attn: on`
- KV cache quantization (q8_0 or q4_0) to enable 64K+ context within 24GB VRAM
- `min_p` set to 0.0 across all profiles (community standard for these models)

## Models to add (excluding mmproj files)

### Qwen3.6-27B (1 file: Qwen3.6-27B-Q4_K_M.gguf, ~10.5 GB)
Recommended sampling per HF model card and Unsloth: temp 0.6 / top_p 0.95 / top_k 20

| # | Profile ID | Name | n_ctx | cache_k/v | temp | top_k | repeat_pen |
|---|-----------|------|-------|-----------|------|-------|------------|
| 1 | qwen36-27b-balanced-64k | Qwen3.6-27B Balanced 64K | 65536 | q8_0/q8_0 | 0.6 | 20 | 1.0 |
| 2 | qwen36-27b-thinking-64k | Qwen3.6-27B Thinking 64K | 65536 | q8_0/q8_0 | 1.0 | 20 | 1.0 |
| 3 | qwen36-27b-extended-128k | Qwen3.6-27B Extended 128K | 131072 | q4_0/q4_0 | 0.6 | 20 | 1.05 |

### Gemma 4 12B (2 files: Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB)
Google official: temp 1.0 / top_p 0.95 / top_k 64

| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k |
|---|-----------|------|------|-------|-----------|------|-------|
| 4 | gemma4-12b-standard-q6-64k | Gemma4 12B Standard Q6 64K | Q6_K_XL | 65536 | q8_0/q8_0 | 1.0 | 64 |
| 5 | gemma4-12b-extended-q6-128k | Gemma4 12B Extended Q6 128K | Q6_K_XL | 131072 | q4_0/q4_0 | 1.0 | 64 |
| 6 | gemma4-12b-compact-iq4-64k | Gemma4 12B Compact IQ4 64K | IQ4_XS | 65536 | q8_0/q8_0 | 1.0 | 64 |
| 7 | gemma4-12b-compact-long-128k | Gemma4 12B Compact IQ4 128K | IQ4_XS | 131072 | q8_0/q8_0 | 1.0 | 64 |

### Gemma 4 26B-A4B (2 files: Q4_K_M ~10.5 GB, IQ4_XS ~6 GB)
MoE, 4B active. Same sampling as 12B family.

| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | repeat_pen |
|---|-----------|------|------|-------|-----------|------|-------|------------|
| 8 | gemma4-26b-balanced-64k | Gemma4 26B Balanced 64K | Q4_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | 1.0 |
| 9 | gemma4-26b-extended-128k | Gemma4 26B Extended 128K | Q4_K_M | 131072 | q4_0/q4_0 | 1.0 | 64 | 1.15 |
| 10 | gemma4-26b-ultra-long-iq4-256k | Gemma4 26B Ultra-Long IQ4 256K | IQ4_XS | 262144 | q4_0/q4_0 | 1.0 | 64 | 1.0 |

### Qwen3.6-35B-A3B (2 files: UD-Q4_K_M ~14 GB, MTP-UD-Q4_K_M ~16 GB)
**MTP note:** Unsloth benchmark shows MTP is net-negative on single 3090. Including MTP profile anyway since user has the file.

| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | MTP |
|---|-----------|------|------|-------|-----------|------|-------|-----|
| 11 | qwen36-35b-fast-64k | Qwen3.6-35B Fast 64K | UD-Q4 | 65536 | q8_0/q8_0 | 0.6 | 20 | no |
| 12 | qwen36-35b-thinking-64k | Qwen3.6-35B Thinking 64K | UD-Q4 | 65536 | q8_0/q8_0 | 1.0 | 20 | no |
| 13 | qwen36-35b-extended-128k | Qwen3.6-35B Extended 128K | UD-Q4 | 131072 | q4_0/q4_0 | 0.6 | 20 | no |
| 14 | qwen36-35b-mtp-128k | Qwen3.6-35B MTP 128K | MTP-UD-Q4 | 131072 | q8_0/q8_0 | 0.6 | 20 | yes (n=3) |

### Uncensored models (apply censored family params)

| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | Based on |
|---|-----------|------|------|-------|-----------|------|-------|----------|
| 15 | qwen36-35b-hauhau-aggressive-64k | Qwen3.6-35B HauhauCS Aggressive 64K | Uncensored-HauhauCS-Q4_K_P | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
| 16 | qwen36-35b-genesis-apex-64k | Qwen3.6-35B Genesis APEX 64K | Uncensored-Genesis-APEX | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
| 17 | qwen36-35b-genesis-mtp-apex-128k | Qwen3.6-35B Genesis MTP APEX 128K | Uncensored-Genesis-MTP-APEX | 131072 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B MTP |
| 18 | gemma4-26b-hauhau-balanced-64k | Gemma4 26B HauhauCS Balanced 64K | Uncensored-HauhauCS-Q5_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | Gemma4 26B balanced |

**Total: 18 profiles**

## Flag mapping (manifest → llama-server CLI)

Manifest flags use camelCase keys that the sidecar passes as `--key value` to llama-server:

| Manifest key | CLI flag | Type | Notes |
|-------------|----------|------|-------|
| n_gpu_layers | --n-gpu-layers | int | 999 = all |
| n_ctx | --ctx-size | int | context window |
| cache_type_k | --cache-type-k | str | q8_0, q4_0 |
| cache_type_v | --cache-type-v | str | q8_0, q4_0 |
| flash_attn | --flash-attn | bool | true/on |
| temp | --temp | float | sampling |
| top_p | --top-p | float | sampling |
| top_k | --top-k | int | sampling |
| repeat_penalty | --repeat-penalty | float | sampling |
| min_p | --min-p | float | 0.0 |
| spec_type | --spec-type | str | draft-mtp (only MTP profiles) |
| spec_draft_n_max | --spec-draft-n-max | int | 3 (only MTP profiles) |
| presence_penalty | --presence-penalty | float | 0.0 |

## Actions
1. Create branch `feature/add-model-profiles` from master
2. Create issues on Gitea for each model family (4 issues: qwen27, gemma12b, gemma26b, qwen35b)
3. Update `deploy/manifest.yaml` with all 18 profiles
4. Update tests if flag structure requires it
5. Run tests, commit
fix: add probe endpoints and no-model fallback for Hermes Desktop compatibility Hermes Desktop sends probe requests to validate providers before allowing model switching. The router was returning 503 for all of these because the catch-all proxy requires a 'model' field in the request body. Added explicit handlers for: - GET /v1/models/{model_id} — OpenAI single-model lookup - GET /api/tags — Ollama model list discovery - POST /api/show — Ollama model info - GET /api/v1/models — Ollama-compatible model list - GET /v1/props, GET /props — llama.cpp server properties - GET /version — llama.cpp version Also fixed the catch-all proxy to route requests with no model body to the currently active backend instead of returning 503. 2026-06-15 18:22:15 +03:00			`# Plan: Add user model profiles to manifest.yaml`
			`# Date: 2025-06-15`
			`# Author: Hermes Agent`
			`# Status: DRAFT`

			`## Context`
			`User has a collection of GGUF models on their Main PC (10.0.4.11, RTX 3090 24GB VRAM).`
			`The intelligence-router manifest needs profiles for all models with researched llama.cpp parameters.`
			`Research sourced from r/LocalLLaMA, HuggingFace model cards, and community blog posts.`

			`## Hardware constraints`
			`- GPU: RTX 3090, 24GB VRAM`
			- All profiles use `n_gpu_layers: 999` (offload all layers that fit)
			- All profiles use `flash-attn: on`
			`- KV cache quantization (q8_0 or q4_0) to enable 64K+ context within 24GB VRAM`
			- `min_p` set to 0.0 across all profiles (community standard for these models)

			`## Models to add (excluding mmproj files)`

			`### Qwen3.6-27B (1 file: Qwen3.6-27B-Q4_K_M.gguf, ~10.5 GB)`
			`Recommended sampling per HF model card and Unsloth: temp 0.6 / top_p 0.95 / top_k 20`

			`\| # \| Profile ID \| Name \| n_ctx \| cache_k/v \| temp \| top_k \| repeat_pen \|`
			`\|---\|-----------\|------\|-------\|-----------\|------\|-------\|------------\|`
			`\| 1 \| qwen36-27b-balanced-64k \| Qwen3.6-27B Balanced 64K \| 65536 \| q8_0/q8_0 \| 0.6 \| 20 \| 1.0 \|`
			`\| 2 \| qwen36-27b-thinking-64k \| Qwen3.6-27B Thinking 64K \| 65536 \| q8_0/q8_0 \| 1.0 \| 20 \| 1.0 \|`
			`\| 3 \| qwen36-27b-extended-128k \| Qwen3.6-27B Extended 128K \| 131072 \| q4_0/q4_0 \| 0.6 \| 20 \| 1.05 \|`

			`### Gemma 4 12B (2 files: Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB)`
			`Google official: temp 1.0 / top_p 0.95 / top_k 64`

			`\| # \| Profile ID \| Name \| File \| n_ctx \| cache_k/v \| temp \| top_k \|`
			`\|---\|-----------\|------\|------\|-------\|-----------\|------\|-------\|`
			`\| 4 \| gemma4-12b-standard-q6-64k \| Gemma4 12B Standard Q6 64K \| Q6_K_XL \| 65536 \| q8_0/q8_0 \| 1.0 \| 64 \|`
			`\| 5 \| gemma4-12b-extended-q6-128k \| Gemma4 12B Extended Q6 128K \| Q6_K_XL \| 131072 \| q4_0/q4_0 \| 1.0 \| 64 \|`
			`\| 6 \| gemma4-12b-compact-iq4-64k \| Gemma4 12B Compact IQ4 64K \| IQ4_XS \| 65536 \| q8_0/q8_0 \| 1.0 \| 64 \|`
			`\| 7 \| gemma4-12b-compact-long-128k \| Gemma4 12B Compact IQ4 128K \| IQ4_XS \| 131072 \| q8_0/q8_0 \| 1.0 \| 64 \|`

			`### Gemma 4 26B-A4B (2 files: Q4_K_M ~10.5 GB, IQ4_XS ~6 GB)`
			`MoE, 4B active. Same sampling as 12B family.`

			`\| # \| Profile ID \| Name \| File \| n_ctx \| cache_k/v \| temp \| top_k \| repeat_pen \|`
			`\|---\|-----------\|------\|------\|-------\|-----------\|------\|-------\|------------\|`
			`\| 8 \| gemma4-26b-balanced-64k \| Gemma4 26B Balanced 64K \| Q4_K_M \| 65536 \| q8_0/q8_0 \| 1.0 \| 64 \| 1.0 \|`
			`\| 9 \| gemma4-26b-extended-128k \| Gemma4 26B Extended 128K \| Q4_K_M \| 131072 \| q4_0/q4_0 \| 1.0 \| 64 \| 1.15 \|`
			`\| 10 \| gemma4-26b-ultra-long-iq4-256k \| Gemma4 26B Ultra-Long IQ4 256K \| IQ4_XS \| 262144 \| q4_0/q4_0 \| 1.0 \| 64 \| 1.0 \|`

			`### Qwen3.6-35B-A3B (2 files: UD-Q4_K_M ~14 GB, MTP-UD-Q4_K_M ~16 GB)`
			`MTP note: Unsloth benchmark shows MTP is net-negative on single 3090. Including MTP profile anyway since user has the file.`

			`\| # \| Profile ID \| Name \| File \| n_ctx \| cache_k/v \| temp \| top_k \| MTP \|`
			`\|---\|-----------\|------\|------\|-------\|-----------\|------\|-------\|-----\|`
			`\| 11 \| qwen36-35b-fast-64k \| Qwen3.6-35B Fast 64K \| UD-Q4 \| 65536 \| q8_0/q8_0 \| 0.6 \| 20 \| no \|`
			`\| 12 \| qwen36-35b-thinking-64k \| Qwen3.6-35B Thinking 64K \| UD-Q4 \| 65536 \| q8_0/q8_0 \| 1.0 \| 20 \| no \|`
			`\| 13 \| qwen36-35b-extended-128k \| Qwen3.6-35B Extended 128K \| UD-Q4 \| 131072 \| q4_0/q4_0 \| 0.6 \| 20 \| no \|`
			`\| 14 \| qwen36-35b-mtp-128k \| Qwen3.6-35B MTP 128K \| MTP-UD-Q4 \| 131072 \| q8_0/q8_0 \| 0.6 \| 20 \| yes (n=3) \|`

			`### Uncensored models (apply censored family params)`

			`\| # \| Profile ID \| Name \| File \| n_ctx \| cache_k/v \| temp \| top_k \| Based on \|`
			`\|---\|-----------\|------\|------\|-------\|-----------\|------\|-------\|----------\|`
			`\| 15 \| qwen36-35b-hauhau-aggressive-64k \| Qwen3.6-35B HauhauCS Aggressive 64K \| Uncensored-HauhauCS-Q4_K_P \| 65536 \| q8_0/q8_0 \| 0.6 \| 20 \| Qwen3.6-35B fast \|`
			`\| 16 \| qwen36-35b-genesis-apex-64k \| Qwen3.6-35B Genesis APEX 64K \| Uncensored-Genesis-APEX \| 65536 \| q8_0/q8_0 \| 0.6 \| 20 \| Qwen3.6-35B fast \|`
			`\| 17 \| qwen36-35b-genesis-mtp-apex-128k \| Qwen3.6-35B Genesis MTP APEX 128K \| Uncensored-Genesis-MTP-APEX \| 131072 \| q8_0/q8_0 \| 0.6 \| 20 \| Qwen3.6-35B MTP \|`
			`\| 18 \| gemma4-26b-hauhau-balanced-64k \| Gemma4 26B HauhauCS Balanced 64K \| Uncensored-HauhauCS-Q5_K_M \| 65536 \| q8_0/q8_0 \| 1.0 \| 64 \| Gemma4 26B balanced \|`

			`Total: 18 profiles`

			`## Flag mapping (manifest → llama-server CLI)`

			Manifest flags use camelCase keys that the sidecar passes as `--key value` to llama-server:

			`\| Manifest key \| CLI flag \| Type \| Notes \|`
			`\|-------------\|----------\|------\|-------\|`
			`\| n_gpu_layers \| --n-gpu-layers \| int \| 999 = all \|`
			`\| n_ctx \| --ctx-size \| int \| context window \|`
			`\| cache_type_k \| --cache-type-k \| str \| q8_0, q4_0 \|`
			`\| cache_type_v \| --cache-type-v \| str \| q8_0, q4_0 \|`
			`\| flash_attn \| --flash-attn \| bool \| true/on \|`
			`\| temp \| --temp \| float \| sampling \|`
			`\| top_p \| --top-p \| float \| sampling \|`
			`\| top_k \| --top-k \| int \| sampling \|`
			`\| repeat_penalty \| --repeat-penalty \| float \| sampling \|`
			`\| min_p \| --min-p \| float \| 0.0 \|`
			`\| spec_type \| --spec-type \| str \| draft-mtp (only MTP profiles) \|`
			`\| spec_draft_n_max \| --spec-draft-n-max \| int \| 3 (only MTP profiles) \|`
			`\| presence_penalty \| --presence-penalty \| float \| 0.0 \|`

			`## Actions`
			1. Create branch `feature/add-model-profiles` from master
			`2. Create issues on Gitea for each model family (4 issues: qwen27, gemma12b, gemma26b, qwen35b)`
			3. Update `deploy/manifest.yaml` with all 18 profiles
			`4. Update tests if flag structure requires it`
			`5. Run tests, commit`