Issue #4: Automatic model detection and switch - Router extracts model from chat body, queries sidecar, triggers switch on mismatch - Matching active model routes directly to Main PC - No active model triggers cold start switch - Tests: 4 test_router_model_detection.py Issue #5: SSE switch progress feedback - _sse_format() correctly serializes SSE events - sse_progress_stream() generates phase progression events - Proxy yields SSE events then actual response - Tests: 3 test_router_sse_progress.py Issue #6: Circuit breaker + OpenRouter fallback - Circuit tracks Sidecar failures, opens after MAX_RECOVERY_ATTEMPTS (3) - OpenRouter API key from env, no longer uses x-intelligence-level header - Fixes: OPENROUTER_BASE, SSE format, circuit state isolation - Tests: 7 test_router_circuit_breaker.py Issue #7: LXC fallback chain completion - Full fallback: Main PC → OpenRouter → LXC - Each backend health-checked via /v1/models before routing - All backends down → 503 response - Fixed: execute() wrapped in try/except to trigger fallback chain - Tests: 3 test_router_fallback_lxc.py Issue #8: Systemd service deployment - deploy/llm-sidecar.service: systemd unit with Restart=always - deploy/manifest.yaml: example manifest with 3 profiles - deploy/README.md: deployment instructions - Updated: docker-compose.yml, requirements.txt, Dockerfile Test framework improvements: - tests/conftest.py: shared URL patches for all router tests - Fixed global state pollution in circuit breaker tests - Fixed test sidecar switch test (AsyncMock for async function) Total: 42 tests passing
4.3 KiB
4.3 KiB
Intelligence Router — Context & Glossary
Terminology
| Term | Definition |
|---|---|
| Router | The FastAPI proxy running in Docker (10.0.4.100:9001). Intercepts LLM requests, checks active model, and routes accordingly. |
| Sidecar | A lightweight Python service running on the Main PC via systemd. Manages the llama-server subprocess and serves manifest/profile data. |
| Profile | A named model configuration from the manifest. Contains a model path, display name, and arbitrary llama-server flags. A single GGUF can have multiple profiles. |
| Manifest | A YAML file on the Main PC (/home/bigt/AI/llm/manifest.yaml) that lists all available profiles. Source of truth for what models Hermes sees. |
| Model Switch | The destructive handoff process: stop current llama-server, start new one with chosen profile's flags, wait for readiness. |
| Active Model | The profile currently loaded in llama-server. Queried from the sidecar before each request. |
| Fallback | The LXC container (10.0.4.200) running a fixed model. Pure fallback — no switching, no sidecar. Always-on safety net. |
| Queue | In-memory request buffer held during a model switch. Hard cap: 120 seconds. Drains once sidecar reports ready. |
Architecture
Hermes (Desktop App)
↕ (OpenAI-compatible API)
Intelligence Router (Docker, 10.0.4.100:9001)
├─→ Sidecar (Main PC, 10.0.4.11:8081) — model switching, manifest, status
├─→ OpenRouter (DeepSeek V4 Flash) — after 3 failed sidecar recoveries
└─→ Fallback SLM (LXC, 10.0.4.200) — out-of-credits safety net
Decisions
- Manifest over scan — profiles explicitly listed, not discovered by filesystem walk. Allows multiple configurations per GGUF.
- Flexible flags — each profile carries an arbitrary
flagsdict. No predetermined set of parameters. - Stateless routing — router always asks the sidecar for the active model before each request. No local caching of state.
- Cold start — sidecar starts with no model loaded. User picks from Hermes picker.
- Queue on switch — first request triggers switch, subsequent requests queue. Hard cap: 120s.
- SSE feedback — router injects
event: model_switchingSSE event so Hermes shows progress instead of a blank spinner. - LXC as pure fallback — no switching, no sidecar. Out-of-credits safety net.
- Sidecar as systemd service — auto-restart on crash, starts at boot, no default model.
- Circuit breaker — sidecar auto-restarts llama-server up to 3 times on crash, then router falls back to OpenRouter.
- Queue cap — max 10 queued requests, 120s hard timeout.
429beyond capacity. - Readiness detection — sidecar polls
localhost:8080/v1/modelsevery 500ms. Unblocks queue on200. - Switch lock — in-memory lock prevents concurrent switches. Subsequent requests join queue.
- Custom provider in Hermes — router registered as
customwithbase_url: http://10.0.4.100:9001/v1. No auth. - OpenRouter stripped from direct routing — old
x-intelligence-level: Highremoved. OpenRouter is a fallback backend, not a direct routing rule. - OpenRouter key — stored in router
.envasOPENROUTER_API_KEY. - Fallback chain: Main PC → OpenRouter → LXC. Each level tried only if the previous fails.
Implementation Files
| File | Purpose |
|---|---|
main.py |
Router — FastAPI proxy with routing, queue, circuit breaker, fallback chain |
sidecar/app.py |
Sidecar — FastAPI service for model management |
sidecar/manifest.py |
Sidecar manifest YAML loading and validation |
deploy/llm-sidecar.service |
Systemd service unit file for the sidecar |
deploy/manifest.yaml |
Example manifest file |
deploy/README.md |
Deployment instructions |
API Endpoints
Sidecar (10.0.4.11:8081)
GET /models/available— List all manifest profilesGET /models/status— Current active model statusPOST /models/switch— Switch to a different model profile
Router (10.0.4.100:9001)
GET /v1/models— OpenAI-compatible model list (proxies from sidecar)GET /models/status— Proxy to sidecar statusPOST /models/switch— Proxy to sidecar switchGET /health— Router health check/{path:path}— Smart proxy with automatic switching and fallback