intelligence-router/deploy
root e9790c00dc feat: add 15 model profiles to manifest.yaml
- Qwen3.6-27B: 3 profiles (balanced/thinking/extended)
- Gemma 4 12B: 4 profiles (Q6_K_XL and IQ4_XS variants)
- Gemma 4 26B-A4B: 3 profiles (Q4_K_M and IQ4_XS)
- Qwen3.6-35B-A3B: 3 profiles (fast/thinking/extended, non-MTP)
- Uncensored: 3 profiles (HauhauCS, Genesis APEX)
- Add pytest.ini for test discovery
- All profiles use KV cache quantization (q8_0/q4_0) for 64K-128K context
- Embedded sampling parameters per model family
- Based on research from r/LocalLLaMA, Unsloth benchmarks, HF model cards
2026-06-15 12:34:46 +00:00
..
llm-sidecar.service Epic: Model Switching via Sidecar — Issues #4-#7 + #8 deployment 2026-06-15 01:13:36 +00:00
manifest.yaml feat: add 15 model profiles to manifest.yaml 2026-06-15 12:34:46 +00:00
README.md Epic: Model Switching via Sidecar — Issues #4-#7 + #8 deployment 2026-06-15 01:13:36 +00:00

LLM Sidecar — Deployment Guide

Quick Install

On the Main PC:

# 1. Copy the service file
sudo cp deploy/llm-sidecar.service /etc/systemd/system/

# 2. Copy the manifest (adjust paths as needed)
mkdir -p /home/bigt/AI/llm
cp deploy/manifest.yaml /home/bigt/AI/llm/manifest.yaml

# 3. Create a .env for the sidecar (optional)
cat > /home/bigt/AI/llm/.env << 'EOF'
# Sidecar configuration
MANIFEST_PATH=/home/bigt/AI/llm/manifest.yaml
SIDECAR_PORT=8081
EOF

# 4. Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable --now llm-sidecar

# 5. Verify it's running
sudo systemctl status llm-sidecar

Verify

# Check sidecar is responding
curl http://10.0.4.11:8081/models/available

# Check model status
curl http://10.0.4.11:8081/models/status

# Test the router
curl http://10.0.4.100:9001/v1/models

Configuration

Environment Variables

Variable Default Description
MANIFEST_PATH /home/bigt/AI/llm/manifest.yaml Path to the YAML manifest file
SIDECAR_PORT 8081 Port the sidecar listens on

Manifest Format

- id: model-id
  name: "Display Name"
  model_path: "/path/to/model.gguf"
  flags:          # Arbitrary llama-server flags
    n_ctx: 8192
    n_gpu_layers: 35
  • id: Unique identifier used in model field of chat completions
  • name: Human-readable display name
  • model_path: Absolute path to the GGUF file
  • flags: Any llama-server CLI flags (n_ctx, n_gpu_layers, etc.)

Managing the Service

# Start/Stop/Restart
sudo systemctl start llm-sidecar
sudo systemctl stop llm-sidecar
sudo systemctl restart llm-sidecar

# View logs
sudo journalctl -u llm-sidecar -f

# Check status
sudo systemctl status llm-sidecar

# Disable auto-start
sudo systemctl disable llm-sidecar

Troubleshooting

  • Sidecar not starting: Check sudo journalctl -u llm-sidecar -n 50
  • Manifest errors: Check that YAML is valid (python3 -c "import yaml; yaml.safe_load(open('manifest.yaml'))")
  • llama-server crashes: Sidecar auto-restarts it up to 3 times before the circuit breaker opens
  • Port conflict: Change SIDECAR_PORT in the service environment