intelligence-router/CONTEXT.md

# Intelligence Router — Context & Glossary

## Terminology

| Term | Definition |
|------|------------|
| **Router** | The FastAPI proxy running in Docker (10.0.4.100:9001). Intercepts LLM requests, checks active model, and routes accordingly. |
| **Sidecar** | A lightweight Python service running on the Main PC via systemd. Manages the llama-server subprocess and serves manifest/profile data. |
| **Profile** | A named model configuration from the manifest. Contains a model path, display name, and arbitrary llama-server flags. A single GGUF can have multiple profiles. |
| **Manifest** | A YAML file on the Main PC (`/home/bigt/AI/llm/manifest.yaml`) that lists all available profiles. Source of truth for what models Hermes sees. |
| **Model Switch** | The destructive handoff process: stop current llama-server, start new one with chosen profile's flags, wait for readiness. |
| **Active Model** | The profile currently loaded in llama-server. Queried from the sidecar before each request. |
| **Fallback** | The LXC container (10.0.4.200) running a fixed model. Pure fallback — no switching, no sidecar. Always-on safety net. |
| **Queue** | In-memory request buffer held during a model switch. Hard cap: 120 seconds. Drains once sidecar reports ready. |

## Architecture

```
Hermes (Desktop App)
    ↕ (OpenAI-compatible API)
Intelligence Router (Docker, 10.0.4.100:9001)
    ├─→ Sidecar (Main PC, 10.0.4.11:8081) — model switching, manifest, status
    ├─→ OpenRouter (DeepSeek V4 Flash) — after 3 failed sidecar recoveries
    └─→ Fallback SLM (LXC, 10.0.4.200) — out-of-credits safety net
```

## Decisions

- **Manifest over scan** — profiles explicitly listed, not discovered by filesystem walk. Allows multiple configurations per GGUF.
- **Flexible flags** — each profile carries an arbitrary `flags` dict. No predetermined set of parameters.
- **Stateless routing** — router always asks the sidecar for the active model before each request. No local caching of state.
- **Cold start** — sidecar starts with no model loaded. User picks from Hermes picker.
- **Queue on switch** — first request triggers switch, subsequent requests queue. Hard cap: 120s.
- **SSE feedback** — router injects `event: model_switching` SSE event so Hermes shows progress instead of a blank spinner.
- **LXC as pure fallback** — no switching, no sidecar. Out-of-credits safety net.
- **Sidecar as systemd service** — auto-restart on crash, starts at boot, no default model.
- **Circuit breaker** — sidecar auto-restarts llama-server up to 3 times on crash, then router falls back to OpenRouter.
- **Queue cap** — max 10 queued requests, 120s hard timeout. `429` beyond capacity.
- **Readiness detection** — sidecar polls `localhost:8080/v1/models` every 500ms. Unblocks queue on `200`.
- **Switch lock** — in-memory lock prevents concurrent switches. Subsequent requests join queue.
- **Custom provider in Hermes** — router registered as `custom` with `base_url: http://10.0.4.100:9001/v1`. No auth.
- **OpenRouter stripped from direct routing** — old `x-intelligence-level: High` removed. OpenRouter is a fallback backend, not a direct routing rule.
- **OpenRouter key** — stored in router `.env` as `OPENROUTER_API_KEY`.
- **Fallback chain**: Main PC → OpenRouter → LXC. Each level tried only if the previous fails.

## Implementation Files

| File | Purpose |
|------|---------|
| `main.py` | Router — FastAPI proxy with routing, queue, circuit breaker, fallback chain |
| `sidecar/app.py` | Sidecar — FastAPI service for model management |
| `sidecar/manifest.py` | Sidecar manifest YAML loading and validation |
| `deploy/llm-sidecar.service` | Systemd service unit file for the sidecar |
| `deploy/manifest.yaml` | Example manifest file |
| `deploy/README.md` | Deployment instructions |

## API Endpoints

### Sidecar (`10.0.4.11:8081`)
- `GET /models/available` — List all manifest profiles
- `GET /models/status` — Current active model status
- `POST /models/switch` — Switch to a different model profile

### Router (`10.0.4.100:9001`)
- `GET /v1/models` — OpenAI-compatible model list (proxies from sidecar)
- `GET /models/status` — Proxy to sidecar status
- `POST /models/switch` — Proxy to sidecar switch
- `GET /health` — Router health check
- `/{path:path}` — Smart proxy with automatic switching and fallback
Added next changes 2026-06-15 03:09:31 +03:00			`# Intelligence Router — Context & Glossary`

			`## Terminology`

			`\| Term \| Definition \|`
			`\|------\|------------\|`
			`\| Router \| The FastAPI proxy running in Docker (10.0.4.100:9001). Intercepts LLM requests, checks active model, and routes accordingly. \|`
			`\| Sidecar \| A lightweight Python service running on the Main PC via systemd. Manages the llama-server subprocess and serves manifest/profile data. \|`
			`\| Profile \| A named model configuration from the manifest. Contains a model path, display name, and arbitrary llama-server flags. A single GGUF can have multiple profiles. \|`
			\| Manifest \| A YAML file on the Main PC (`/home/bigt/AI/llm/manifest.yaml`) that lists all available profiles. Source of truth for what models Hermes sees. \|
			`\| Model Switch \| The destructive handoff process: stop current llama-server, start new one with chosen profile's flags, wait for readiness. \|`
			`\| Active Model \| The profile currently loaded in llama-server. Queried from the sidecar before each request. \|`
			`\| Fallback \| The LXC container (10.0.4.200) running a fixed model. Pure fallback — no switching, no sidecar. Always-on safety net. \|`
			`\| Queue \| In-memory request buffer held during a model switch. Hard cap: 120 seconds. Drains once sidecar reports ready. \|`

			`## Architecture`

			```
			`Hermes (Desktop App)`
			`↕ (OpenAI-compatible API)`
			`Intelligence Router (Docker, 10.0.4.100:9001)`
Epic: Model Switching via Sidecar — Issues #4-#7 + #8 deployment Issue #4: Automatic model detection and switch - Router extracts model from chat body, queries sidecar, triggers switch on mismatch - Matching active model routes directly to Main PC - No active model triggers cold start switch - Tests: 4 test_router_model_detection.py Issue #5: SSE switch progress feedback - _sse_format() correctly serializes SSE events - sse_progress_stream() generates phase progression events - Proxy yields SSE events then actual response - Tests: 3 test_router_sse_progress.py Issue #6: Circuit breaker + OpenRouter fallback - Circuit tracks Sidecar failures, opens after MAX_RECOVERY_ATTEMPTS (3) - OpenRouter API key from env, no longer uses x-intelligence-level header - Fixes: OPENROUTER_BASE, SSE format, circuit state isolation - Tests: 7 test_router_circuit_breaker.py Issue #7: LXC fallback chain completion - Full fallback: Main PC → OpenRouter → LXC - Each backend health-checked via /v1/models before routing - All backends down → 503 response - Fixed: execute() wrapped in try/except to trigger fallback chain - Tests: 3 test_router_fallback_lxc.py Issue #8: Systemd service deployment - deploy/llm-sidecar.service: systemd unit with Restart=always - deploy/manifest.yaml: example manifest with 3 profiles - deploy/README.md: deployment instructions - Updated: docker-compose.yml, requirements.txt, Dockerfile Test framework improvements: - tests/conftest.py: shared URL patches for all router tests - Fixed global state pollution in circuit breaker tests - Fixed test sidecar switch test (AsyncMock for async function) Total: 42 tests passing 2026-06-15 04:13:36 +03:00			`├─→ Sidecar (Main PC, 10.0.4.11:8081) — model switching, manifest, status`
Added next changes 2026-06-15 03:09:31 +03:00			`├─→ OpenRouter (DeepSeek V4 Flash) — after 3 failed sidecar recoveries`
			`└─→ Fallback SLM (LXC, 10.0.4.200) — out-of-credits safety net`
			```

			`## Decisions`

			`- Manifest over scan — profiles explicitly listed, not discovered by filesystem walk. Allows multiple configurations per GGUF.`
			- Flexible flags — each profile carries an arbitrary `flags` dict. No predetermined set of parameters.
			`- Stateless routing — router always asks the sidecar for the active model before each request. No local caching of state.`
			`- Cold start — sidecar starts with no model loaded. User picks from Hermes picker.`
			`- Queue on switch — first request triggers switch, subsequent requests queue. Hard cap: 120s.`
			- SSE feedback — router injects `event: model_switching` SSE event so Hermes shows progress instead of a blank spinner.
			`- LXC as pure fallback — no switching, no sidecar. Out-of-credits safety net.`
			`- Sidecar as systemd service — auto-restart on crash, starts at boot, no default model.`
			`- Circuit breaker — sidecar auto-restarts llama-server up to 3 times on crash, then router falls back to OpenRouter.`
			- Queue cap — max 10 queued requests, 120s hard timeout. `429` beyond capacity.
			- Readiness detection — sidecar polls `localhost:8080/v1/models` every 500ms. Unblocks queue on `200`.
			`- Switch lock — in-memory lock prevents concurrent switches. Subsequent requests join queue.`
			- Custom provider in Hermes — router registered as `custom` with `base_url: http://10.0.4.100:9001/v1`. No auth.
			- OpenRouter stripped from direct routing — old `x-intelligence-level: High` removed. OpenRouter is a fallback backend, not a direct routing rule.
			- OpenRouter key — stored in router `.env` as `OPENROUTER_API_KEY`.
Epic: Model Switching via Sidecar — Issues #4-#7 + #8 deployment Issue #4: Automatic model detection and switch - Router extracts model from chat body, queries sidecar, triggers switch on mismatch - Matching active model routes directly to Main PC - No active model triggers cold start switch - Tests: 4 test_router_model_detection.py Issue #5: SSE switch progress feedback - _sse_format() correctly serializes SSE events - sse_progress_stream() generates phase progression events - Proxy yields SSE events then actual response - Tests: 3 test_router_sse_progress.py Issue #6: Circuit breaker + OpenRouter fallback - Circuit tracks Sidecar failures, opens after MAX_RECOVERY_ATTEMPTS (3) - OpenRouter API key from env, no longer uses x-intelligence-level header - Fixes: OPENROUTER_BASE, SSE format, circuit state isolation - Tests: 7 test_router_circuit_breaker.py Issue #7: LXC fallback chain completion - Full fallback: Main PC → OpenRouter → LXC - Each backend health-checked via /v1/models before routing - All backends down → 503 response - Fixed: execute() wrapped in try/except to trigger fallback chain - Tests: 3 test_router_fallback_lxc.py Issue #8: Systemd service deployment - deploy/llm-sidecar.service: systemd unit with Restart=always - deploy/manifest.yaml: example manifest with 3 profiles - deploy/README.md: deployment instructions - Updated: docker-compose.yml, requirements.txt, Dockerfile Test framework improvements: - tests/conftest.py: shared URL patches for all router tests - Fixed global state pollution in circuit breaker tests - Fixed test sidecar switch test (AsyncMock for async function) Total: 42 tests passing 2026-06-15 04:13:36 +03:00			`- Fallback chain: Main PC → OpenRouter → LXC. Each level tried only if the previous fails.`

			`## Implementation Files`

			`\| File \| Purpose \|`
			`\|------\|---------\|`
			\| `main.py` \| Router — FastAPI proxy with routing, queue, circuit breaker, fallback chain \|
			\| `sidecar/app.py` \| Sidecar — FastAPI service for model management \|
			\| `sidecar/manifest.py` \| Sidecar manifest YAML loading and validation \|
			\| `deploy/llm-sidecar.service` \| Systemd service unit file for the sidecar \|
			\| `deploy/manifest.yaml` \| Example manifest file \|
			\| `deploy/README.md` \| Deployment instructions \|

			`## API Endpoints`

			### Sidecar (`10.0.4.11:8081`)
			- `GET /models/available` — List all manifest profiles
			- `GET /models/status` — Current active model status
			- `POST /models/switch` — Switch to a different model profile

			### Router (`10.0.4.100:9001`)
			- `GET /v1/models` — OpenAI-compatible model list (proxies from sidecar)
			- `GET /models/status` — Proxy to sidecar status
			- `POST /models/switch` — Proxy to sidecar switch
			- `GET /health` — Router health check
			- `/{path:path}` — Smart proxy with automatic switching and fallback