intelligence-router/CONTEXT.md
2026-06-15 00:09:31 +00:00

3.2 KiB

Intelligence Router — Context & Glossary

Terminology

Term Definition
Router The FastAPI proxy running in Docker (10.0.4.100:9001). Intercepts LLM requests, checks active model, and routes accordingly.
Sidecar A lightweight Python service running on the Main PC via systemd. Manages the llama-server subprocess and serves manifest/profile data.
Profile A named model configuration from the manifest. Contains a model path, display name, and arbitrary llama-server flags. A single GGUF can have multiple profiles.
Manifest A YAML file on the Main PC (/home/bigt/AI/llm/manifest.yaml) that lists all available profiles. Source of truth for what models Hermes sees.
Model Switch The destructive handoff process: stop current llama-server, start new one with chosen profile's flags, wait for readiness.
Active Model The profile currently loaded in llama-server. Queried from the sidecar before each request.
Fallback The LXC container (10.0.4.200) running a fixed model. Pure fallback — no switching, no sidecar. Always-on safety net.
Queue In-memory request buffer held during a model switch. Hard cap: 120 seconds. Drains once sidecar reports ready.

Architecture

Hermes (Desktop App)
    ↕ (OpenAI-compatible API)
Intelligence Router (Docker, 10.0.4.100:9001)
    ├─→ Sidecar (Main PC, 10.0.4.11) — model switching, manifest, status
    ├─→ OpenRouter (DeepSeek V4 Flash) — after 3 failed sidecar recoveries
    └─→ Fallback SLM (LXC, 10.0.4.200) — out-of-credits safety net

Decisions

  • Manifest over scan — profiles explicitly listed, not discovered by filesystem walk. Allows multiple configurations per GGUF.
  • Flexible flags — each profile carries an arbitrary flags dict. No predetermined set of parameters.
  • Stateless routing — router always asks the sidecar for the active model before each request. No local caching of state.
  • Cold start — sidecar starts with no model loaded. User picks from Hermes picker.
  • Queue on switch — first request triggers switch, subsequent requests queue. Hard cap: 120s.
  • SSE feedback — router injects event: model_switching SSE event so Hermes shows progress instead of a blank spinner.
  • LXC as pure fallback — no switching, no sidecar. Out-of-credits safety net.
  • Sidecar as systemd service — auto-restart on crash, starts at boot, no default model.
  • Circuit breaker — sidecar auto-restarts llama-server up to 3 times on crash, then router falls back to OpenRouter.
  • Queue cap — max 10 queued requests, 120s hard timeout. 429 beyond capacity.
  • Readiness detection — sidecar polls localhost:8080/v1/models every 500ms. Unblocks queue on 200.
  • Switch lock — in-memory lock prevents concurrent switches. Subsequent requests join queue.
  • Custom provider in Hermes — router registered as custom with base_url: http://10.0.4.100:9001/v1. No auth.
  • OpenRouter stripped from direct routing — old x-intelligence-level: High removed. OpenRouter is a fallback backend, not a direct routing rule.
  • OpenRouter key — stored in router .env as OPENROUTER_API_KEY.