intelligence-router/sidecar
root 45dd793b69 fix: sidecar process kill was not awaiting wait() — old server held GPU VRAM
- _kill_llama_server() was sync calling an unawaited coroutine. process.wait() created
  a discarded coroutine object — the old llama-server was never waited on to release
  GPU memory before starting a new one, causing OOM on rapid model switches.
  Fixed with async await + 10s SIGTERM timeout + SIGKILL fallback.

- Changed _switch_lock from threading.Lock to asyncio.Lock() to prevent event loop
  deadlock during long switch operations.

- Router proxy: only trigger model switches for POST /v1/chat/completions and
  /v1/completions. Non-chat endpoints (GET probes, /api/show) no longer trigger
  unwanted model reloads.

- _ollama_show_lookup: return active profile context size when model_name is empty.
  Previously returned 404, causing Hermes Desktop to default to 256k context.

- Always drain_queue() + complete_switch() after switch failure so queued requests
  don't hang forever waiting on a never-set switching event.
2026-06-17 23:49:57 +00:00
..
__init__.py Epic: Model Switching via Sidecar — Issues #2-#3 2026-06-15 00:49:24 +00:00
app.py fix: sidecar process kill was not awaiting wait() — old server held GPU VRAM 2026-06-17 23:49:57 +00:00
manifest.py Epic: Model Switching via Sidecar — Issues #2-#3 2026-06-15 00:49:24 +00:00