The FIRST request that triggers a model switch was blocking the HTTP response
for 10-30s while waiting for the sidecar to load the model. Hermes Desktop's
client timed out during this wait, causing 'nothing happens' on new session.
Fix: refactored the proxy handler so ALL requests during a model switch use
the same SSE streaming pattern (immediate 200, progress events, then actual
response piped through after switch completes). The switch now runs as a
background asyncio task via create_task().
- Added _background_switch() — runs POST /models/switch in background task
with complete_switch() + drain_queue() in finally block
- All switch-triggering requests go through queue_request() + StreamingResponse
- SSE generator now falls through to OpenRouter/LXC if Main PC unreachable
(switch failure case) instead of hanging indefinitely
Sidecar fixes from previous commit:
- _kill_llama_server() is now async with proper await on process termination
- _switch_lock changed from threading.Lock to asyncio.Lock()
- _kill_llama_server() was sync calling an unawaited coroutine. process.wait() created
a discarded coroutine object — the old llama-server was never waited on to release
GPU memory before starting a new one, causing OOM on rapid model switches.
Fixed with async await + 10s SIGTERM timeout + SIGKILL fallback.
- Changed _switch_lock from threading.Lock to asyncio.Lock() to prevent event loop
deadlock during long switch operations.
- Router proxy: only trigger model switches for POST /v1/chat/completions and
/v1/completions. Non-chat endpoints (GET probes, /api/show) no longer trigger
unwanted model reloads.
- _ollama_show_lookup: return active profile context size when model_name is empty.
Previously returned 404, causing Hermes Desktop to default to 256k context.
- Always drain_queue() + complete_switch() after switch failure so queued requests
don't hang forever waiting on a never-set switching event.
The circuit breaker opened after MAX_RECOVERY_ATTEMPTS failures but
was never reset because the sidecar status query (which calls
circuit_reset()) was skipped when the circuit was open. This caused
a permanent deadlock: all subsequent requests went to the LXC fallback
with no recovery possible.
Fix: always query the sidecar for /models/status, even when the
circuit is open. If the sidecar responds successfully, reset the
circuit. The circuit breaker now only prevents the SWITCH operation,
not the status health check. If a model is already running when the
circuit is open, route to it directly.
When the primary request to llama-server (10.0.4.11:8081) raises an
exception (connection refused, timeout), it was silently swallowed by
the catch-all except block, making it look like a sidecar/switch
failure when it was actually a network-level error.
Now prints: 'PROXY EXCEPTION on primary <url>: <ExceptionType>: <msg>'
Two changes to debug the fallback-to-LXC issue:
1. Added debug logging on switch failure: prints the profile name,
sidecar response status, and error message. Also calls
circuit_record_failure() so subsequent requests don't wait the
full 120-second timeout before falling back.
2. Fixed scoping bug: sidecar_status was only defined inside the
else branch of the circuit breaker check. Initialized to None
at function scope alongside target_url and error to prevent
NameError when circuit is open.
Three changes to debug and fix Hermes Desktop integration:
1. /api/show: Added GET handler alongside existing POST handler.
Hermes Desktop probes with GET ?model=xxx, not POST body.
Refactored shared lookup logic into _ollama_show_lookup().
2. /v1 root: Added handler returning basic info. Hermes Desktop
probes this URL and ERR_CONNECTION_REFUSED was blocking
full provider validation.
3. Proxy execute(): Added debug logging for non-200 responses.
Prints the backend URL, status code, and first 500 bytes of body
to help diagnose why llama-server returns 400 on
/v1/chat/completions.
Hermes Desktop reads the context size from /api/show's 'parameters'
field. This was hardcoded to 'num_ctx 4096' for every model, causing
'context too small' errors when the user's system prompt + conversation
exceeded 4K tokens.
Now extracts the actual ctx-size from the profile's flags and returns
the correct value (e.g. 'num_ctx 131072' for the 128K profiles).
Hermes Desktop sends probe requests to validate providers before allowing
model switching. The router was returning 503 for all of these because
the catch-all proxy requires a 'model' field in the request body.
Added explicit handlers for:
- GET /v1/models/{model_id} — OpenAI single-model lookup
- GET /api/tags — Ollama model list discovery
- POST /api/show — Ollama model info
- GET /api/v1/models — Ollama-compatible model list
- GET /v1/props, GET /props — llama.cpp server properties
- GET /version — llama.cpp version
Also fixed the catch-all proxy to route requests with no model body to
the currently active backend instead of returning 503.
The sidecar is deployed on port 8080 instead of 8081. Update all:
- Default SIDECAR_PORT in sidecar/app.py
- Default SIDECAR_URL in main.py (router)
- deploy/llm-sidecar.service Environment
- deploy/README.md (.env example + config table)
- All 7 test files (conftest, circuit-breaker, fallback, queue,
model-detection, sse-progress, v1-models)
Issue #4: Automatic model detection and switch
- Router extracts model from chat body, queries sidecar, triggers switch on mismatch
- Matching active model routes directly to Main PC
- No active model triggers cold start switch
- Tests: 4 test_router_model_detection.py
Issue #5: SSE switch progress feedback
- _sse_format() correctly serializes SSE events
- sse_progress_stream() generates phase progression events
- Proxy yields SSE events then actual response
- Tests: 3 test_router_sse_progress.py
Issue #6: Circuit breaker + OpenRouter fallback
- Circuit tracks Sidecar failures, opens after MAX_RECOVERY_ATTEMPTS (3)
- OpenRouter API key from env, no longer uses x-intelligence-level header
- Fixes: OPENROUTER_BASE, SSE format, circuit state isolation
- Tests: 7 test_router_circuit_breaker.py
Issue #7: LXC fallback chain completion
- Full fallback: Main PC → OpenRouter → LXC
- Each backend health-checked via /v1/models before routing
- All backends down → 503 response
- Fixed: execute() wrapped in try/except to trigger fallback chain
- Tests: 3 test_router_fallback_lxc.py
Issue #8: Systemd service deployment
- deploy/llm-sidecar.service: systemd unit with Restart=always
- deploy/manifest.yaml: example manifest with 3 profiles
- deploy/README.md: deployment instructions
- Updated: docker-compose.yml, requirements.txt, Dockerfile
Test framework improvements:
- tests/conftest.py: shared URL patches for all router tests
- Fixed global state pollution in circuit breaker tests
- Fixed test sidecar switch test (AsyncMock for async function)
Total: 42 tests passing
Issue #2: Manifest schema + Sidecar foundation
- sidecar/manifest.py: YAML manifest loading and profile validation
- sidecar/app.py: FastAPI sidecar service with /models/available, /models/status endpoints
- Router GET /v1/models: proxies to sidecar, returns OpenAI-compatible model list
- Tests: 12 manifest tests, 6 sidecar endpoint tests, 3 router tests (21 total)
Issue #3: Sidecar model switch + Router request queue
- Sidecar POST /models/switch: stops current llama-server, starts new one, polls for readiness
- Switch lock prevents concurrent switches (threading.Lock for TestClient compatibility)
- Router request queue: max 10 requests, 120s hard timeout, 429 when full
- Router automatic model detection: extracts model from chat body, matches against sidecar status
- Full proxy endpoint with Sidecar → Main PC routing and fallback chain
- Tests: 5 sidecar switch tests, 4 queue tests, 3 router integration tests (12 total)
Total: 33 tests, all passing