# Plan: Add user model profiles to manifest.yaml # Date: 2025-06-15 # Author: Hermes Agent # Status: DRAFT ## Context User has a collection of GGUF models on their Main PC (10.0.4.11, RTX 3090 24GB VRAM). The intelligence-router manifest needs profiles for all models with researched llama.cpp parameters. Research sourced from r/LocalLLaMA, HuggingFace model cards, and community blog posts. ## Hardware constraints - GPU: RTX 3090, 24GB VRAM - All profiles use `n_gpu_layers: 999` (offload all layers that fit) - All profiles use `flash-attn: on` - KV cache quantization (q8_0 or q4_0) to enable 64K+ context within 24GB VRAM - `min_p` set to 0.0 across all profiles (community standard for these models) ## Models to add (excluding mmproj files) ### Qwen3.6-27B (1 file: Qwen3.6-27B-Q4_K_M.gguf, ~10.5 GB) Recommended sampling per HF model card and Unsloth: temp 0.6 / top_p 0.95 / top_k 20 | # | Profile ID | Name | n_ctx | cache_k/v | temp | top_k | repeat_pen | |---|-----------|------|-------|-----------|------|-------|------------| | 1 | qwen36-27b-balanced-64k | Qwen3.6-27B Balanced 64K | 65536 | q8_0/q8_0 | 0.6 | 20 | 1.0 | | 2 | qwen36-27b-thinking-64k | Qwen3.6-27B Thinking 64K | 65536 | q8_0/q8_0 | 1.0 | 20 | 1.0 | | 3 | qwen36-27b-extended-128k | Qwen3.6-27B Extended 128K | 131072 | q4_0/q4_0 | 0.6 | 20 | 1.05 | ### Gemma 4 12B (2 files: Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB) Google official: temp 1.0 / top_p 0.95 / top_k 64 | # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | |---|-----------|------|------|-------|-----------|------|-------| | 4 | gemma4-12b-standard-q6-64k | Gemma4 12B Standard Q6 64K | Q6_K_XL | 65536 | q8_0/q8_0 | 1.0 | 64 | | 5 | gemma4-12b-extended-q6-128k | Gemma4 12B Extended Q6 128K | Q6_K_XL | 131072 | q4_0/q4_0 | 1.0 | 64 | | 6 | gemma4-12b-compact-iq4-64k | Gemma4 12B Compact IQ4 64K | IQ4_XS | 65536 | q8_0/q8_0 | 1.0 | 64 | | 7 | gemma4-12b-compact-long-128k | Gemma4 12B Compact IQ4 128K | IQ4_XS | 131072 | q8_0/q8_0 | 1.0 | 64 | ### Gemma 4 26B-A4B (2 files: Q4_K_M ~10.5 GB, IQ4_XS ~6 GB) MoE, 4B active. Same sampling as 12B family. | # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | repeat_pen | |---|-----------|------|------|-------|-----------|------|-------|------------| | 8 | gemma4-26b-balanced-64k | Gemma4 26B Balanced 64K | Q4_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | 1.0 | | 9 | gemma4-26b-extended-128k | Gemma4 26B Extended 128K | Q4_K_M | 131072 | q4_0/q4_0 | 1.0 | 64 | 1.15 | | 10 | gemma4-26b-ultra-long-iq4-256k | Gemma4 26B Ultra-Long IQ4 256K | IQ4_XS | 262144 | q4_0/q4_0 | 1.0 | 64 | 1.0 | ### Qwen3.6-35B-A3B (2 files: UD-Q4_K_M ~14 GB, MTP-UD-Q4_K_M ~16 GB) **MTP note:** Unsloth benchmark shows MTP is net-negative on single 3090. Including MTP profile anyway since user has the file. | # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | MTP | |---|-----------|------|------|-------|-----------|------|-------|-----| | 11 | qwen36-35b-fast-64k | Qwen3.6-35B Fast 64K | UD-Q4 | 65536 | q8_0/q8_0 | 0.6 | 20 | no | | 12 | qwen36-35b-thinking-64k | Qwen3.6-35B Thinking 64K | UD-Q4 | 65536 | q8_0/q8_0 | 1.0 | 20 | no | | 13 | qwen36-35b-extended-128k | Qwen3.6-35B Extended 128K | UD-Q4 | 131072 | q4_0/q4_0 | 0.6 | 20 | no | | 14 | qwen36-35b-mtp-128k | Qwen3.6-35B MTP 128K | MTP-UD-Q4 | 131072 | q8_0/q8_0 | 0.6 | 20 | yes (n=3) | ### Uncensored models (apply censored family params) | # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | Based on | |---|-----------|------|------|-------|-----------|------|-------|----------| | 15 | qwen36-35b-hauhau-aggressive-64k | Qwen3.6-35B HauhauCS Aggressive 64K | Uncensored-HauhauCS-Q4_K_P | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast | | 16 | qwen36-35b-genesis-apex-64k | Qwen3.6-35B Genesis APEX 64K | Uncensored-Genesis-APEX | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast | | 17 | qwen36-35b-genesis-mtp-apex-128k | Qwen3.6-35B Genesis MTP APEX 128K | Uncensored-Genesis-MTP-APEX | 131072 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B MTP | | 18 | gemma4-26b-hauhau-balanced-64k | Gemma4 26B HauhauCS Balanced 64K | Uncensored-HauhauCS-Q5_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | Gemma4 26B balanced | **Total: 18 profiles** ## Flag mapping (manifest → llama-server CLI) Manifest flags use camelCase keys that the sidecar passes as `--key value` to llama-server: | Manifest key | CLI flag | Type | Notes | |-------------|----------|------|-------| | n_gpu_layers | --n-gpu-layers | int | 999 = all | | n_ctx | --ctx-size | int | context window | | cache_type_k | --cache-type-k | str | q8_0, q4_0 | | cache_type_v | --cache-type-v | str | q8_0, q4_0 | | flash_attn | --flash-attn | bool | true/on | | temp | --temp | float | sampling | | top_p | --top-p | float | sampling | | top_k | --top-k | int | sampling | | repeat_penalty | --repeat-penalty | float | sampling | | min_p | --min-p | float | 0.0 | | spec_type | --spec-type | str | draft-mtp (only MTP profiles) | | spec_draft_n_max | --spec-draft-n-max | int | 3 (only MTP profiles) | | presence_penalty | --presence-penalty | float | 0.0 | ## Actions 1. Create branch `feature/add-model-profiles` from master 2. Create issues on Gitea for each model family (4 issues: qwen27, gemma12b, gemma26b, qwen35b) 3. Update `deploy/manifest.yaml` with all 18 profiles 4. Update tests if flag structure requires it 5. Run tests, commit