Added next changes

2026-06-15 00:09:31 +00:00 · 2026-06-15 00:09:31 +00:00 · b2031d8b7a
commit b2031d8b7a
parent 712fe041b1
3 changed files with 143 additions and 0 deletions
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -0,0 +1,43 @@
+# Intelligence Router — Context & Glossary
+
+## Terminology
+
+| Term | Definition |
+|------|------------|
+| **Router** | The FastAPI proxy running in Docker (10.0.4.100:9001). Intercepts LLM requests, checks active model, and routes accordingly. |
+| **Sidecar** | A lightweight Python service running on the Main PC via systemd. Manages the llama-server subprocess and serves manifest/profile data. |
+| **Profile** | A named model configuration from the manifest. Contains a model path, display name, and arbitrary llama-server flags. A single GGUF can have multiple profiles. |
+| **Manifest** | A YAML file on the Main PC (`/home/bigt/AI/llm/manifest.yaml`) that lists all available profiles. Source of truth for what models Hermes sees. |
+| **Model Switch** | The destructive handoff process: stop current llama-server, start new one with chosen profile's flags, wait for readiness. |
+| **Active Model** | The profile currently loaded in llama-server. Queried from the sidecar before each request. |
+| **Fallback** | The LXC container (10.0.4.200) running a fixed model. Pure fallback — no switching, no sidecar. Always-on safety net. |
+| **Queue** | In-memory request buffer held during a model switch. Hard cap: 120 seconds. Drains once sidecar reports ready. |
+
+## Architecture
+
+```
+Hermes (Desktop App)
+    ↕ (OpenAI-compatible API)
+Intelligence Router (Docker, 10.0.4.100:9001)
+    ├─→ Sidecar (Main PC, 10.0.4.11) — model switching, manifest, status
+    ├─→ OpenRouter (DeepSeek V4 Flash) — after 3 failed sidecar recoveries
+    └─→ Fallback SLM (LXC, 10.0.4.200) — out-of-credits safety net
+```
+
+## Decisions
+
+- **Manifest over scan** — profiles explicitly listed, not discovered by filesystem walk. Allows multiple configurations per GGUF.
+- **Flexible flags** — each profile carries an arbitrary `flags` dict. No predetermined set of parameters.
+- **Stateless routing** — router always asks the sidecar for the active model before each request. No local caching of state.
+- **Cold start** — sidecar starts with no model loaded. User picks from Hermes picker.
+- **Queue on switch** — first request triggers switch, subsequent requests queue. Hard cap: 120s.
+- **SSE feedback** — router injects `event: model_switching` SSE event so Hermes shows progress instead of a blank spinner.
+- **LXC as pure fallback** — no switching, no sidecar. Out-of-credits safety net.
+- **Sidecar as systemd service** — auto-restart on crash, starts at boot, no default model.
+- **Circuit breaker** — sidecar auto-restarts llama-server up to 3 times on crash, then router falls back to OpenRouter.
+- **Queue cap** — max 10 queued requests, 120s hard timeout. `429` beyond capacity.
+- **Readiness detection** — sidecar polls `localhost:8080/v1/models` every 500ms. Unblocks queue on `200`.
+- **Switch lock** — in-memory lock prevents concurrent switches. Subsequent requests join queue.
+- **Custom provider in Hermes** — router registered as `custom` with `base_url: http://10.0.4.100:9001/v1`. No auth.
+- **OpenRouter stripped from direct routing** — old `x-intelligence-level: High` removed. OpenRouter is a fallback backend, not a direct routing rule.
+- **OpenRouter key** — stored in router `.env` as `OPENROUTER_API_KEY`.
--- a/docs/adr/0001-model-management-via-sidecar.md
+++ b/docs/adr/0001-model-management-via-sidecar.md
@ -0,0 +1,5 @@
+# Model management via sidecar with manifest profiles
+
+The router no longer manages models directly. A sidecar service on the Main PC manages llama-server as a subprocess, using a YAML manifest of named profiles. Each profile carries arbitrary llama-server flags — no predetermined schema. The router queries the sidecar for active model state before every request and queues requests (max 10, 120s cap) during switches. Fallback chain: Main PC → OpenRouter (DeepSeek V4 Flash) → LXC. Circuit breaker on sidecar: 3 auto-recover attempts before falling back.
+
+**Why this way:** Manifest profiles over filesystem scan gives per-GGUF multiple configurations. Flexible flags dict avoids rigid parameter schemas that don't match real model diversity. Stateless routing (always ask sidecar) prevents drift between router and reality. Profile IDs as model identifiers (not GGUF filenames) resolve ambiguity when one GGUF has multiple configs. Queue-with-timeout provides smooth UX during cold starts rather than dropping user messages.
--- a/docs/prd/0001-model-switching-via-sidecar.md
+++ b/docs/prd/0001-model-switching-via-sidecar.md
@ -0,0 +1,95 @@
+# PRD — Model Switching via Sidecar
+
+## Problem Statement
+
+The Intelligence Router currently routes LLM requests to a fixed backend (Main PC running a single llama-server instance). When the user wants to switch models — either a different GGUF or different parameters for the same GGUF — they have to manually stop llama-server on the Main PC, start it with the right flags, and wait for it to be ready. The router has no visibility into which model is loaded, what models are available, or whether the backend is ready. Hermes has no way to present a model picker, since `/v1/models` returns whatever llama-server exposes for the currently loaded model. The system needs to support dynamic model switching with zero manual intervention, while maintaining the existing fallback chain.
+
+## Solution
+
+A Sidecar service runs on the Main PC as a systemd service, managing llama-server as a subprocess. The Sidecar reads a YAML manifest of named Profiles from disk and exposes a REST API for listing available models, switching models, and reporting status. The Intelligence Router is refactored to query the Sidecar before each request, trigger a Model Switch when the requested model differs from the Active Model, and queue requests during the switch. Hermes connects to the Router via a custom provider and sees all manifest profiles in its model picker.
+
+## User Stories
+
+1. As a user, I want to see all available models in my Hermes model picker, so that I can choose which model to use before sending a message.
+2. As a user, I want to switch models from the Hermes picker without touching my Main PC, so that model changes are seamless.
+3. As a user, I want my message to queue while the model switches, so that I don't lose my prompt during cold starts.
+4. As a user, I want to see progress feedback while a model is loading, so that I know the system is working instead of staring at a blank spinner.
+5. As a user, I want to configure multiple Profiles for the same GGUF with different parameters (e.g., context length, GPU layers), so that I can switch between configurations without editing config files.
+6. As a user, I want to add new models to the manifest without restarting the Sidecar, so that the model list stays current.
+7. As a user, I want the system to fall back to OpenRouter if the Sidecar fails to recover llama-server after 3 attempts, so that I'm not stuck with no backend.
+8. As a user, I want the system to fall back to the LXC if OpenRouter credits are exhausted, so that I always have a working backend.
+9. As a user, I want the Sidecar to start automatically at boot, so that the system is ready without manual intervention.
+10. As a user, I want the Sidecar to recover automatically if llama-server crashes, so that transient failures don't require me to restart anything.
+11. As a user, I want concurrent requests during a switch to queue (up to 10) rather than fail, so that burst traffic during cold starts is handled gracefully.
+12. As a user, I want queued requests to time out after 120 seconds, so that I'm not waiting forever if something is fundamentally broken.
+13. As a user, I want a 429 response when the queue is full, so that the client knows to stop sending more requests.
+14. As a developer, I want the manifest to live on the Main PC where the GGUFs are, so that model paths are relative and easy to manage.
+15. As a developer, I want the Sidecar API to be simple and stateless (except for the switch lock), so that debugging is straightforward.
+16. As a developer, I want the Router to always ask the Sidecar for the Active Model before each request, so that Router state never drifts from reality.
+
+## Implementation Decisions
+
+### Modules built/modified
+
+- **Sidecar** (new) — Python service running on Main PC via systemd. Dependencies: `fastapi`, `uvicorn`, `pyyaml`. Exposes `/models/available`, `/models/switch`, `/models/status`. Manages llama-server as a subprocess (start, stop, readiness polling).
+- **Router** (modified) — `main.py` refactored: removes direct Main PC routing logic, adds Sidecar client, adds `/v1/models` endpoint that proxies to Sidecar, adds request queue with timeout, adds circuit breaker for Sidecar failures, adds fallback chain (Main PC → OpenRouter → LXC).
+- **Manifest** (new) — YAML file at `/home/bigt/AI/llm/manifest.yaml`. Contains named profiles with `model_path`, `name`, and `flags` dict.
+
+### API contracts
+
+- **Sidecar `/models/available` (GET)** — Returns list of profiles from manifest. Each profile: `{id, name, model_path, flags}`.
+- **Sidecar `/models/switch` (POST)** — Body: `{profile_id}`. Stops current llama-server, starts new one with profile's flags, polls for readiness, returns `{status: "ready", active_profile}` or `{status: "error", message}`.
+- **Sidecar `/models/status` (GET)** — Returns `{active_profile: Profile | null, llama_server_running: bool}`.
+- **Router `/v1/models` (GET)** — OpenAI-compatible model list derived from Sidecar manifest. Each model `id` is the profile ID.
+- **Router proxy endpoint** — Before routing, queries Sidecar status. If `active_profile` matches requested model → route to Main PC. If mismatch → POST to `/models/switch`, queue request, wait for readiness. If Sidecar unreachable → circuit breaker → fallback chain.
+
+### Architectural decisions
+
+- **Sidecar is a systemd service** — Managed by `systemd` on Main PC with `Restart=always`. No default model loaded on start.
+- **Router is stateless regarding model state** — Always queries Sidecar. No local cache of which model is active.
+- **OpenRouter replaces the old `x-intelligence-level: High` header** — No longer a direct routing target. Now a fallback in the chain.
+- **Router port changed to 9001** — `docker-compose.yml` maps `9001:9000` to avoid conflict with existing deployment on 9000.
+- **Manifest reloaded on every `/models/available` call** — No file watcher; Sidecar re-reads manifest on each request so changes are immediately visible.
+
+### Schema — Manifest profile shape
+
+```yaml
+- id: qwen-3-8b
+  name: "Qwen 3 8B"
+  model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
+  flags:
+    n_ctx: 8192
+    n_gpu_layers: 35
+- id: qwen-3-8b-long
+  name: "Qwen 3 8B (Long Context)"
+  model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
+  flags:
+    n_ctx: 32768
+    n_gpu_layers: 20
+```
+
+## Testing Decisions
+
+Tests focus exclusively on external behavior — HTTP contracts and routing logic — not internal implementation details.
+
+- **Sidecar unit tests** — Test manifest parsing, API endpoints, and switch logic with llama-server subprocess mocked via `unittest.mock.patch`. Tests cover: empty manifest, valid manifest, switch to new profile, switch when already on same profile, readiness detection, crash recovery.
+- **Router unit tests** — Test routing decisions with Sidecar mocked via `respx` (httpx mocking). Tests cover: active model match routes to Main PC, mismatch triggers switch + queue, Sidecar down triggers circuit breaker + fallback, queue cap enforcement, queue timeout, 429 beyond capacity, `/v1/models` returns manifest profiles.
+- **No integration tests against real llama-server** — Too heavy and slow. Real-world validation done via manual smoke test on Main PC.
+- **Test framework** — `pytest`, `httpx`, FastAPI `TestClient`. Adds `respx` and `pyyaml` to test dependencies.
+
+## Out of Scope
+
+- LXC model switching — LXC remains a static fallback with no Sidecar.
+- Authentication on the Sidecar — Runs on Main PC behind firewall; no auth required.
+- Multi-user support — Single user, single Sidecar instance.
+- WebSocket streaming to Sidecar — Sidecar uses REST only.
+- GUI or web admin for the Sidecar — Manifest is edited manually.
+- Model download or management — Sidecar does not download or install GGUFs.
+- GPU resource monitoring — Sidecar doesn't check VRAM before switching.
+
+## Further Notes
+
+- The manifest path is hardcoded in the Sidecar systemd unit environment variable (`MANIFEST_PATH`).
+- The Sidecar communicates with llama-server via localhost:8080 (same port as existing deployment).
+- The Router's circuit breaker tracks Sidecar failures in memory. A Sidecar restart resets the counter.
+- The `OPENROUTER_API_KEY` is stored in the Router's `.env` file.