intelligence-router/docs/prd/0001-model-switching-via-sidecar.md

# PRD — Model Switching via Sidecar

## Problem Statement

The Intelligence Router currently routes LLM requests to a fixed backend (Main PC running a single llama-server instance). When the user wants to switch models — either a different GGUF or different parameters for the same GGUF — they have to manually stop llama-server on the Main PC, start it with the right flags, and wait for it to be ready. The router has no visibility into which model is loaded, what models are available, or whether the backend is ready. Hermes has no way to present a model picker, since `/v1/models` returns whatever llama-server exposes for the currently loaded model. The system needs to support dynamic model switching with zero manual intervention, while maintaining the existing fallback chain.

## Solution

A Sidecar service runs on the Main PC as a systemd service, managing llama-server as a subprocess. The Sidecar reads a YAML manifest of named Profiles from disk and exposes a REST API for listing available models, switching models, and reporting status. The Intelligence Router is refactored to query the Sidecar before each request, trigger a Model Switch when the requested model differs from the Active Model, and queue requests during the switch. Hermes connects to the Router via a custom provider and sees all manifest profiles in its model picker.

## User Stories

1. As a user, I want to see all available models in my Hermes model picker, so that I can choose which model to use before sending a message.
2. As a user, I want to switch models from the Hermes picker without touching my Main PC, so that model changes are seamless.
3. As a user, I want my message to queue while the model switches, so that I don't lose my prompt during cold starts.
4. As a user, I want to see progress feedback while a model is loading, so that I know the system is working instead of staring at a blank spinner.
5. As a user, I want to configure multiple Profiles for the same GGUF with different parameters (e.g., context length, GPU layers), so that I can switch between configurations without editing config files.
6. As a user, I want to add new models to the manifest without restarting the Sidecar, so that the model list stays current.
7. As a user, I want the system to fall back to OpenRouter if the Sidecar fails to recover llama-server after 3 attempts, so that I'm not stuck with no backend.
8. As a user, I want the system to fall back to the LXC if OpenRouter credits are exhausted, so that I always have a working backend.
9. As a user, I want the Sidecar to start automatically at boot, so that the system is ready without manual intervention.
10. As a user, I want the Sidecar to recover automatically if llama-server crashes, so that transient failures don't require me to restart anything.
11. As a user, I want concurrent requests during a switch to queue (up to 10) rather than fail, so that burst traffic during cold starts is handled gracefully.
12. As a user, I want queued requests to time out after 120 seconds, so that I'm not waiting forever if something is fundamentally broken.
13. As a user, I want a 429 response when the queue is full, so that the client knows to stop sending more requests.
14. As a developer, I want the manifest to live on the Main PC where the GGUFs are, so that model paths are relative and easy to manage.
15. As a developer, I want the Sidecar API to be simple and stateless (except for the switch lock), so that debugging is straightforward.
16. As a developer, I want the Router to always ask the Sidecar for the Active Model before each request, so that Router state never drifts from reality.

## Implementation Decisions

### Modules built/modified

- **Sidecar** (new) — Python service running on Main PC via systemd. Dependencies: `fastapi`, `uvicorn`, `pyyaml`. Exposes `/models/available`, `/models/switch`, `/models/status`. Manages llama-server as a subprocess (start, stop, readiness polling).
- **Router** (modified) — `main.py` refactored: removes direct Main PC routing logic, adds Sidecar client, adds `/v1/models` endpoint that proxies to Sidecar, adds request queue with timeout, adds circuit breaker for Sidecar failures, adds fallback chain (Main PC → OpenRouter → LXC).
- **Manifest** (new) — YAML file at `/home/bigt/AI/llm/manifest.yaml`. Contains named profiles with `model_path`, `name`, and `flags` dict.

### API contracts

- **Sidecar `/models/available` (GET)** — Returns list of profiles from manifest. Each profile: `{id, name, model_path, flags}`.
- **Sidecar `/models/switch` (POST)** — Body: `{profile_id}`. Stops current llama-server, starts new one with profile's flags, polls for readiness, returns `{status: "ready", active_profile}` or `{status: "error", message}`.
- **Sidecar `/models/status` (GET)** — Returns `{active_profile: Profile | null, llama_server_running: bool}`.
- **Router `/v1/models` (GET)** — OpenAI-compatible model list derived from Sidecar manifest. Each model `id` is the profile ID.
- **Router proxy endpoint** — Before routing, queries Sidecar status. If `active_profile` matches requested model → route to Main PC. If mismatch → POST to `/models/switch`, queue request, wait for readiness. If Sidecar unreachable → circuit breaker → fallback chain.

### Architectural decisions

- **Sidecar is a systemd service** — Managed by `systemd` on Main PC with `Restart=always`. No default model loaded on start.
- **Router is stateless regarding model state** — Always queries Sidecar. No local cache of which model is active.
- **OpenRouter replaces the old `x-intelligence-level: High` header** — No longer a direct routing target. Now a fallback in the chain.
- **Router port changed to 9001** — `docker-compose.yml` maps `9001:9000` to avoid conflict with existing deployment on 9000.
- **Manifest reloaded on every `/models/available` call** — No file watcher; Sidecar re-reads manifest on each request so changes are immediately visible.

### Schema — Manifest profile shape

```yaml
- id: qwen-3-8b
  name: "Qwen 3 8B"
  model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
  flags:
    n_ctx: 8192
    n_gpu_layers: 35
- id: qwen-3-8b-long
  name: "Qwen 3 8B (Long Context)"
  model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
  flags:
    n_ctx: 32768
    n_gpu_layers: 20
```

## Testing Decisions

Tests focus exclusively on external behavior — HTTP contracts and routing logic — not internal implementation details.

- **Sidecar unit tests** — Test manifest parsing, API endpoints, and switch logic with llama-server subprocess mocked via `unittest.mock.patch`. Tests cover: empty manifest, valid manifest, switch to new profile, switch when already on same profile, readiness detection, crash recovery.
- **Router unit tests** — Test routing decisions with Sidecar mocked via `respx` (httpx mocking). Tests cover: active model match routes to Main PC, mismatch triggers switch + queue, Sidecar down triggers circuit breaker + fallback, queue cap enforcement, queue timeout, 429 beyond capacity, `/v1/models` returns manifest profiles.
- **No integration tests against real llama-server** — Too heavy and slow. Real-world validation done via manual smoke test on Main PC.
- **Test framework** — `pytest`, `httpx`, FastAPI `TestClient`. Adds `respx` and `pyyaml` to test dependencies.

## Out of Scope

- LXC model switching — LXC remains a static fallback with no Sidecar.
- Authentication on the Sidecar — Runs on Main PC behind firewall; no auth required.
- Multi-user support — Single user, single Sidecar instance.
- WebSocket streaming to Sidecar — Sidecar uses REST only.
- GUI or web admin for the Sidecar — Manifest is edited manually.
- Model download or management — Sidecar does not download or install GGUFs.
- GPU resource monitoring — Sidecar doesn't check VRAM before switching.

## Further Notes

- The manifest path is hardcoded in the Sidecar systemd unit environment variable (`MANIFEST_PATH`).
- The Sidecar communicates with llama-server via localhost:8080 (same port as existing deployment).
- The Router's circuit breaker tracks Sidecar failures in memory. A Sidecar restart resets the counter.
- The `OPENROUTER_API_KEY` is stored in the Router's `.env` file.
Added next changes 2026-06-15 03:09:31 +03:00			`# PRD — Model Switching via Sidecar`

			`## Problem Statement`

			The Intelligence Router currently routes LLM requests to a fixed backend (Main PC running a single llama-server instance). When the user wants to switch models — either a different GGUF or different parameters for the same GGUF — they have to manually stop llama-server on the Main PC, start it with the right flags, and wait for it to be ready. The router has no visibility into which model is loaded, what models are available, or whether the backend is ready. Hermes has no way to present a model picker, since `/v1/models` returns whatever llama-server exposes for the currently loaded model. The system needs to support dynamic model switching with zero manual intervention, while maintaining the existing fallback chain.

			`## Solution`

			A Sidecar service runs on the Main PC as a systemd service, managing llama-server as a subprocess. The Sidecar reads a YAML manifest of named Profiles from disk and exposes a REST API for listing available models, switching models, and reporting status. The Intelligence Router is refactored to query the Sidecar before each request, trigger a Model Switch when the requested model differs from the Active Model, and queue requests during the switch. Hermes connects to the Router via a custom provider and sees all manifest profiles in its model picker.

			`## User Stories`

			`1. As a user, I want to see all available models in my Hermes model picker, so that I can choose which model to use before sending a message.`
			`2. As a user, I want to switch models from the Hermes picker without touching my Main PC, so that model changes are seamless.`
			`3. As a user, I want my message to queue while the model switches, so that I don't lose my prompt during cold starts.`
			`4. As a user, I want to see progress feedback while a model is loading, so that I know the system is working instead of staring at a blank spinner.`
			`5. As a user, I want to configure multiple Profiles for the same GGUF with different parameters (e.g., context length, GPU layers), so that I can switch between configurations without editing config files.`
			`6. As a user, I want to add new models to the manifest without restarting the Sidecar, so that the model list stays current.`
			`7. As a user, I want the system to fall back to OpenRouter if the Sidecar fails to recover llama-server after 3 attempts, so that I'm not stuck with no backend.`
			`8. As a user, I want the system to fall back to the LXC if OpenRouter credits are exhausted, so that I always have a working backend.`
			`9. As a user, I want the Sidecar to start automatically at boot, so that the system is ready without manual intervention.`
			`10. As a user, I want the Sidecar to recover automatically if llama-server crashes, so that transient failures don't require me to restart anything.`
			`11. As a user, I want concurrent requests during a switch to queue (up to 10) rather than fail, so that burst traffic during cold starts is handled gracefully.`
			`12. As a user, I want queued requests to time out after 120 seconds, so that I'm not waiting forever if something is fundamentally broken.`
			`13. As a user, I want a 429 response when the queue is full, so that the client knows to stop sending more requests.`
			`14. As a developer, I want the manifest to live on the Main PC where the GGUFs are, so that model paths are relative and easy to manage.`
			`15. As a developer, I want the Sidecar API to be simple and stateless (except for the switch lock), so that debugging is straightforward.`
			`16. As a developer, I want the Router to always ask the Sidecar for the Active Model before each request, so that Router state never drifts from reality.`

			`## Implementation Decisions`

			`### Modules built/modified`

			- Sidecar (new) — Python service running on Main PC via systemd. Dependencies: `fastapi`, `uvicorn`, `pyyaml`. Exposes `/models/available`, `/models/switch`, `/models/status`. Manages llama-server as a subprocess (start, stop, readiness polling).
			- Router (modified) — `main.py` refactored: removes direct Main PC routing logic, adds Sidecar client, adds `/v1/models` endpoint that proxies to Sidecar, adds request queue with timeout, adds circuit breaker for Sidecar failures, adds fallback chain (Main PC → OpenRouter → LXC).
			- Manifest (new) — YAML file at `/home/bigt/AI/llm/manifest.yaml`. Contains named profiles with `model_path`, `name`, and `flags` dict.

			`### API contracts`

			- Sidecar `/models/available` (GET) — Returns list of profiles from manifest. Each profile: `{id, name, model_path, flags}`.
			- Sidecar `/models/switch` (POST) — Body: `{profile_id}`. Stops current llama-server, starts new one with profile's flags, polls for readiness, returns `{status: "ready", active_profile}` or `{status: "error", message}`.
			- Sidecar `/models/status` (GET) — Returns `{active_profile: Profile \| null, llama_server_running: bool}`.
			- Router `/v1/models` (GET) — OpenAI-compatible model list derived from Sidecar manifest. Each model `id` is the profile ID.
			- Router proxy endpoint — Before routing, queries Sidecar status. If `active_profile` matches requested model → route to Main PC. If mismatch → POST to `/models/switch`, queue request, wait for readiness. If Sidecar unreachable → circuit breaker → fallback chain.

			`### Architectural decisions`

			- Sidecar is a systemd service — Managed by `systemd` on Main PC with `Restart=always`. No default model loaded on start.
			`- Router is stateless regarding model state — Always queries Sidecar. No local cache of which model is active.`
			- OpenRouter replaces the old `x-intelligence-level: High` header — No longer a direct routing target. Now a fallback in the chain.
			- Router port changed to 9001 — `docker-compose.yml` maps `9001:9000` to avoid conflict with existing deployment on 9000.
			- Manifest reloaded on every `/models/available` call — No file watcher; Sidecar re-reads manifest on each request so changes are immediately visible.

			`### Schema — Manifest profile shape`

			```yaml
			`- id: qwen-3-8b`
			`name: "Qwen 3 8B"`
			`model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"`
			`flags:`
			`n_ctx: 8192`
			`n_gpu_layers: 35`
			`- id: qwen-3-8b-long`
			`name: "Qwen 3 8B (Long Context)"`
			`model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"`
			`flags:`
			`n_ctx: 32768`
			`n_gpu_layers: 20`
			```

			`## Testing Decisions`

			`Tests focus exclusively on external behavior — HTTP contracts and routing logic — not internal implementation details.`

			- Sidecar unit tests — Test manifest parsing, API endpoints, and switch logic with llama-server subprocess mocked via `unittest.mock.patch`. Tests cover: empty manifest, valid manifest, switch to new profile, switch when already on same profile, readiness detection, crash recovery.
			- Router unit tests — Test routing decisions with Sidecar mocked via `respx` (httpx mocking). Tests cover: active model match routes to Main PC, mismatch triggers switch + queue, Sidecar down triggers circuit breaker + fallback, queue cap enforcement, queue timeout, 429 beyond capacity, `/v1/models` returns manifest profiles.
			`- No integration tests against real llama-server — Too heavy and slow. Real-world validation done via manual smoke test on Main PC.`
			- Test framework — `pytest`, `httpx`, FastAPI `TestClient`. Adds `respx` and `pyyaml` to test dependencies.

			`## Out of Scope`

			`- LXC model switching — LXC remains a static fallback with no Sidecar.`
			`- Authentication on the Sidecar — Runs on Main PC behind firewall; no auth required.`
			`- Multi-user support — Single user, single Sidecar instance.`
			`- WebSocket streaming to Sidecar — Sidecar uses REST only.`
			`- GUI or web admin for the Sidecar — Manifest is edited manually.`
			`- Model download or management — Sidecar does not download or install GGUFs.`
			`- GPU resource monitoring — Sidecar doesn't check VRAM before switching.`

			`## Further Notes`

			- The manifest path is hardcoded in the Sidecar systemd unit environment variable (`MANIFEST_PATH`).
			`- The Sidecar communicates with llama-server via localhost:8080 (same port as existing deployment).`
			`- The Router's circuit breaker tracks Sidecar failures in memory. A Sidecar restart resets the counter.`
			- The `OPENROUTER_API_KEY` is stored in the Router's `.env` file.