Three fixes for the model-not-loading bug:
1. **YAML boolean → CLI flag bug**: YAML parses 'on'/'off'/'yes'/'no' as Python
bools. str(True)='True' which is INVALID for llama.cpp's --flash-attn flag
(expects 'on'/'off'/'auto'). Added _flag_value() converter that maps bools
to 'on'/'off' strings.
2. **llama-server stderr was DEVNULL**: All error messages (bad model path,
OOM, invalid flag) were invisible. Now captured to /tmp/llama-server-stderr.log
and dumped to the sidecar log on failure.
3. **Reduce polling timeout**: 240 retries × 0.5s = 120s hang. Reduced to
60 retries × 0.5s = 30s. Still dumps stderr + exit code on failure.
4. **Manifest VRAM fix**: gemma4-26b-compact-long-128k used q8_0 KV cache at
128K context (~24GB on 24GB RTX 3090 — borderline OOM). Changed to q4_0
(~18GB, comfortable).
- Qwen3.6-27B: 3 profiles (balanced/thinking/extended)
- Gemma 4 12B: 4 profiles (Q6_K_XL and IQ4_XS variants)
- Gemma 4 26B-A4B: 3 profiles (Q4_K_M and IQ4_XS)
- Qwen3.6-35B-A3B: 3 profiles (fast/thinking/extended, non-MTP)
- Uncensored: 3 profiles (HauhauCS, Genesis APEX)
- Add pytest.ini for test discovery
- All profiles use KV cache quantization (q8_0/q4_0) for 64K-128K context
- Embedded sampling parameters per model family
- Based on research from r/LocalLLaMA, Unsloth benchmarks, HF model cards
Issue #4: Automatic model detection and switch
- Router extracts model from chat body, queries sidecar, triggers switch on mismatch
- Matching active model routes directly to Main PC
- No active model triggers cold start switch
- Tests: 4 test_router_model_detection.py
Issue #5: SSE switch progress feedback
- _sse_format() correctly serializes SSE events
- sse_progress_stream() generates phase progression events
- Proxy yields SSE events then actual response
- Tests: 3 test_router_sse_progress.py
Issue #6: Circuit breaker + OpenRouter fallback
- Circuit tracks Sidecar failures, opens after MAX_RECOVERY_ATTEMPTS (3)
- OpenRouter API key from env, no longer uses x-intelligence-level header
- Fixes: OPENROUTER_BASE, SSE format, circuit state isolation
- Tests: 7 test_router_circuit_breaker.py
Issue #7: LXC fallback chain completion
- Full fallback: Main PC → OpenRouter → LXC
- Each backend health-checked via /v1/models before routing
- All backends down → 503 response
- Fixed: execute() wrapped in try/except to trigger fallback chain
- Tests: 3 test_router_fallback_lxc.py
Issue #8: Systemd service deployment
- deploy/llm-sidecar.service: systemd unit with Restart=always
- deploy/manifest.yaml: example manifest with 3 profiles
- deploy/README.md: deployment instructions
- Updated: docker-compose.yml, requirements.txt, Dockerfile
Test framework improvements:
- tests/conftest.py: shared URL patches for all router tests
- Fixed global state pollution in circuit breaker tests
- Fixed test sidecar switch test (AsyncMock for async function)
Total: 42 tests passing