root 4914363089 Epic: Model Switching via Sidecar — Issues #4-#7 + #8 deployment

Issue #4: Automatic model detection and switch
- Router extracts model from chat body, queries sidecar, triggers switch on mismatch
- Matching active model routes directly to Main PC
- No active model triggers cold start switch
- Tests: 4 test_router_model_detection.py

Issue #5: SSE switch progress feedback
- _sse_format() correctly serializes SSE events
- sse_progress_stream() generates phase progression events
- Proxy yields SSE events then actual response
- Tests: 3 test_router_sse_progress.py

Issue #6: Circuit breaker + OpenRouter fallback
- Circuit tracks Sidecar failures, opens after MAX_RECOVERY_ATTEMPTS (3)
- OpenRouter API key from env, no longer uses x-intelligence-level header
- Fixes: OPENROUTER_BASE, SSE format, circuit state isolation
- Tests: 7 test_router_circuit_breaker.py

Issue #7: LXC fallback chain completion
- Full fallback: Main PC → OpenRouter → LXC
- Each backend health-checked via /v1/models before routing
- All backends down → 503 response
- Fixed: execute() wrapped in try/except to trigger fallback chain
- Tests: 3 test_router_fallback_lxc.py

Issue #8: Systemd service deployment
- deploy/llm-sidecar.service: systemd unit with Restart=always
- deploy/manifest.yaml: example manifest with 3 profiles
- deploy/README.md: deployment instructions
- Updated: docker-compose.yml, requirements.txt, Dockerfile

Test framework improvements:
- tests/conftest.py: shared URL patches for all router tests
- Fixed global state pollution in circuit breaker tests
- Fixed test sidecar switch test (AsyncMock for async function)

Total: 42 tests passing

2026-06-15 01:13:36 +00:00

4.3 KiB

Raw Permalink Blame History

Intelligence Router — Context & Glossary

Terminology

Term	Definition
Router	The FastAPI proxy running in Docker (10.0.4.100:9001). Intercepts LLM requests, checks active model, and routes accordingly.
Sidecar	A lightweight Python service running on the Main PC via systemd. Manages the llama-server subprocess and serves manifest/profile data.
Profile	A named model configuration from the manifest. Contains a model path, display name, and arbitrary llama-server flags. A single GGUF can have multiple profiles.
Manifest	A YAML file on the Main PC (`/home/bigt/AI/llm/manifest.yaml`) that lists all available profiles. Source of truth for what models Hermes sees.
Model Switch	The destructive handoff process: stop current llama-server, start new one with chosen profile's flags, wait for readiness.
Active Model	The profile currently loaded in llama-server. Queried from the sidecar before each request.
Fallback	The LXC container (10.0.4.200) running a fixed model. Pure fallback — no switching, no sidecar. Always-on safety net.
Queue	In-memory request buffer held during a model switch. Hard cap: 120 seconds. Drains once sidecar reports ready.

Architecture

Hermes (Desktop App)
    ↕ (OpenAI-compatible API)
Intelligence Router (Docker, 10.0.4.100:9001)
    ├─→ Sidecar (Main PC, 10.0.4.11:8081) — model switching, manifest, status
    ├─→ OpenRouter (DeepSeek V4 Flash) — after 3 failed sidecar recoveries
    └─→ Fallback SLM (LXC, 10.0.4.200) — out-of-credits safety net

Decisions

Manifest over scan — profiles explicitly listed, not discovered by filesystem walk. Allows multiple configurations per GGUF.
Flexible flags — each profile carries an arbitrary flags dict. No predetermined set of parameters.
Stateless routing — router always asks the sidecar for the active model before each request. No local caching of state.
Cold start — sidecar starts with no model loaded. User picks from Hermes picker.
Queue on switch — first request triggers switch, subsequent requests queue. Hard cap: 120s.
SSE feedback — router injects event: model_switching SSE event so Hermes shows progress instead of a blank spinner.
LXC as pure fallback — no switching, no sidecar. Out-of-credits safety net.
Sidecar as systemd service — auto-restart on crash, starts at boot, no default model.
Circuit breaker — sidecar auto-restarts llama-server up to 3 times on crash, then router falls back to OpenRouter.
Queue cap — max 10 queued requests, 120s hard timeout. 429 beyond capacity.
Readiness detection — sidecar polls localhost:8080/v1/models every 500ms. Unblocks queue on 200.
Switch lock — in-memory lock prevents concurrent switches. Subsequent requests join queue.
Custom provider in Hermes — router registered as custom with base_url: http://10.0.4.100:9001/v1. No auth.
OpenRouter stripped from direct routing — old x-intelligence-level: High removed. OpenRouter is a fallback backend, not a direct routing rule.
OpenRouter key — stored in router .env as OPENROUTER_API_KEY.
Fallback chain: Main PC → OpenRouter → LXC. Each level tried only if the previous fails.

Implementation Files

File	Purpose
`main.py`	Router — FastAPI proxy with routing, queue, circuit breaker, fallback chain
`sidecar/app.py`	Sidecar — FastAPI service for model management
`sidecar/manifest.py`	Sidecar manifest YAML loading and validation
`deploy/llm-sidecar.service`	Systemd service unit file for the sidecar
`deploy/manifest.yaml`	Example manifest file
`deploy/README.md`	Deployment instructions

API Endpoints

Sidecar (`10.0.4.11:8081`)

GET /models/available — List all manifest profiles
GET /models/status — Current active model status
POST /models/switch — Switch to a different model profile

Router (`10.0.4.100:9001`)

GET /v1/models — OpenAI-compatible model list (proxies from sidecar)
GET /models/status — Proxy to sidecar status
POST /models/switch — Proxy to sidecar switch
GET /health — Router health check
/{path:path} — Smart proxy with automatic switching and fallback

4.3 KiB Raw Permalink Blame History

Intelligence Router — Context & Glossary

Terminology

Architecture

Decisions

Implementation Files

API Endpoints

Sidecar (10.0.4.11:8081)

Router (10.0.4.100:9001)

4.3 KiB

Raw Permalink Blame History

Sidecar (`10.0.4.11:8081`)

Router (`10.0.4.100:9001`)