- _kill_llama_server() was sync calling an unawaited coroutine. process.wait() created a discarded coroutine object — the old llama-server was never waited on to release GPU memory before starting a new one, causing OOM on rapid model switches. Fixed with async await + 10s SIGTERM timeout + SIGKILL fallback. - Changed _switch_lock from threading.Lock to asyncio.Lock() to prevent event loop deadlock during long switch operations. - Router proxy: only trigger model switches for POST /v1/chat/completions and /v1/completions. Non-chat endpoints (GET probes, /api/show) no longer trigger unwanted model reloads. - _ollama_show_lookup: return active profile context size when model_name is empty. Previously returned 404, causing Hermes Desktop to default to 256k context. - Always drain_queue() + complete_switch() after switch failure so queued requests don't hang forever waiting on a never-set switching event. |
||
|---|---|---|
| .hermes/plans | ||
| deploy | ||
| docs | ||
| scripts | ||
| sidecar | ||
| tests | ||
| .env | ||
| .gitignore | ||
| CONTEXT.md | ||
| docker-compose.yml | ||
| Dockerfile | ||
| main.py | ||
| pytest.ini | ||
| requirements.txt | ||