History

root 37fee5341e fix: capture llama-server stderr, fix YAML boolean flag conversion, reduce polling timeout Three fixes for the model-not-loading bug: 1. YAML boolean → CLI flag bug: YAML parses 'on'/'off'/'yes'/'no' as Python bools. str(True)='True' which is INVALID for llama.cpp's --flash-attn flag (expects 'on'/'off'/'auto'). Added _flag_value() converter that maps bools to 'on'/'off' strings. 2. llama-server stderr was DEVNULL: All error messages (bad model path, OOM, invalid flag) were invisible. Now captured to /tmp/llama-server-stderr.log and dumped to the sidecar log on failure. 3. Reduce polling timeout: 240 retries × 0.5s = 120s hang. Reduced to 60 retries × 0.5s = 30s. Still dumps stderr + exit code on failure. 4. Manifest VRAM fix: gemma4-26b-compact-long-128k used q8_0 KV cache at 128K context (~24GB on 24GB RTX 3090 — borderline OOM). Changed to q4_0 (~18GB, comfortable).		2026-06-16 00:06:45 +00:00
..
llm-sidecar.service	fix: unbuffer sidecar stdout so logs appear in journalctl	2026-06-15 16:25:58 +00:00
manifest.yaml	fix: capture llama-server stderr, fix YAML boolean flag conversion, reduce polling timeout	2026-06-16 00:06:45 +00:00
README.md	fix: change sidecar port from 8081 to 8080	2026-06-15 13:17:31 +00:00

README.md

LLM Sidecar — Deployment Guide

Quick Install

On the Main PC:

# 1. Copy the service file
sudo cp deploy/llm-sidecar.service /etc/systemd/system/

# 2. Create the working directory and copy files
mkdir -p /home/bigt/AI/llm
cp deploy/manifest.yaml /home/bigt/AI/llm/manifest.yaml

# Copy the sidecar Python package (app.py + manifest.py)
cp -r sidecar/ /home/bigt/AI/llm/sidecar/

# Copy requirements.txt for the venv
cp requirements.txt /home/bigt/AI/llm/

# 3. Create a Python virtual environment with dependencies
python3 -m venv /home/bigt/AI/llm/venv
/home/bigt/AI/llm/venv/bin/pip install -r /home/bigt/AI/llm/requirements.txt

# 4. Create a .env for the sidecar (optional)
cat > /home/bigt/AI/llm/.env << 'EOF'
# Sidecar configuration
MANIFEST_PATH=/home/bigt/AI/llm/manifest.yaml
SIDECAR_PORT=8080
EOF

# 5. Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable --now llm-sidecar

# 6. Verify it's running
sudo systemctl status llm-sidecar

Verify

# Check sidecar is responding
curl http://10.0.4.11:8081/models/available

# Check model status
curl http://10.0.4.11:8081/models/status

# Test the router
curl http://10.0.4.100:9001/v1/models

Configuration

Environment Variables

Variable	Default	Description
`MANIFEST_PATH`	`/home/bigt/AI/llm/manifest.yaml`	Path to the YAML manifest file
`SIDECAR_PORT`	`8080`	Port the sidecar listens on

Manifest Format

- id: model-id
  name: "Display Name"
  model_path: "/path/to/model.gguf"
  flags:          # Arbitrary llama-server flags
    n_ctx: 8192
    n_gpu_layers: 35

id: Unique identifier used in model field of chat completions
name: Human-readable display name
model_path: Absolute path to the GGUF file
flags: Any llama-server CLI flags (n_ctx, n_gpu_layers, etc.)

Managing the Service

# Start/Stop/Restart
sudo systemctl start llm-sidecar
sudo systemctl stop llm-sidecar
sudo systemctl restart llm-sidecar

# View logs
sudo journalctl -u llm-sidecar -f

# Check status
sudo systemctl status llm-sidecar

# Disable auto-start
sudo systemctl disable llm-sidecar

Troubleshooting

Sidecar not starting: Check sudo journalctl -u llm-sidecar -n 50
Manifest errors: Check that YAML is valid (python3 -c "import yaml; yaml.safe_load(open('manifest.yaml'))")
llama-server crashes: Sidecar auto-restarts it up to 3 times before the circuit breaker opens
Port conflict: Change SIDECAR_PORT in the service environment