Commit Graph

36 Commits

Author SHA1 Message Date
root
b3ac21b2c0 fix: first request no longer blocks on model switch — uses background task + SSE
The FIRST request that triggers a model switch was blocking the HTTP response
for 10-30s while waiting for the sidecar to load the model. Hermes Desktop's
client timed out during this wait, causing 'nothing happens' on new session.

Fix: refactored the proxy handler so ALL requests during a model switch use
the same SSE streaming pattern (immediate 200, progress events, then actual
response piped through after switch completes). The switch now runs as a
background asyncio task via create_task().

- Added _background_switch() — runs POST /models/switch in background task
  with complete_switch() + drain_queue() in finally block
- All switch-triggering requests go through queue_request() + StreamingResponse
- SSE generator now falls through to OpenRouter/LXC if Main PC unreachable
  (switch failure case) instead of hanging indefinitely

Sidecar fixes from previous commit:
- _kill_llama_server() is now async with proper await on process termination
- _switch_lock changed from threading.Lock to asyncio.Lock()
2026-06-18 00:10:48 +00:00
root
45dd793b69 fix: sidecar process kill was not awaiting wait() — old server held GPU VRAM
- _kill_llama_server() was sync calling an unawaited coroutine. process.wait() created
  a discarded coroutine object — the old llama-server was never waited on to release
  GPU memory before starting a new one, causing OOM on rapid model switches.
  Fixed with async await + 10s SIGTERM timeout + SIGKILL fallback.

- Changed _switch_lock from threading.Lock to asyncio.Lock() to prevent event loop
  deadlock during long switch operations.

- Router proxy: only trigger model switches for POST /v1/chat/completions and
  /v1/completions. Non-chat endpoints (GET probes, /api/show) no longer trigger
  unwanted model reloads.

- _ollama_show_lookup: return active profile context size when model_name is empty.
  Previously returned 404, causing Hermes Desktop to default to 256k context.

- Always drain_queue() + complete_switch() after switch failure so queued requests
  don't hang forever waiting on a never-set switching event.
2026-06-17 23:49:57 +00:00
root
7e9b3f43e1 fix: circuit breaker deadlock — always query sidecar for status
The circuit breaker opened after MAX_RECOVERY_ATTEMPTS failures but
was never reset because the sidecar status query (which calls
circuit_reset()) was skipped when the circuit was open.  This caused
a permanent deadlock: all subsequent requests went to the LXC fallback
with no recovery possible.

Fix: always query the sidecar for /models/status, even when the
circuit is open.  If the sidecar responds successfully, reset the
circuit.  The circuit breaker now only prevents the SWITCH operation,
not the status health check.  If a model is already running when the
circuit is open, route to it directly.
2026-06-16 22:09:16 +00:00
root
bcf45129f1 fix: add --host 0.0.0.0 to llama-server command
llama-server defaults to binding on 127.0.0.1 (localhost only).
When the router runs on a separate Docker host (10.0.4.100), all
chat completion requests fail with:

  PROXY EXCEPTION on primary http://10.0.4.11:8081/v1/chat/completions:
    ConnectError: All connection attempts failed

Added --host 0.0.0.0 after --port so llama-server listens on all
network interfaces, reachable from the Docker host.
2026-06-16 21:46:07 +00:00
root
75248741e7 fix: log exceptions on primary proxy target
When the primary request to llama-server (10.0.4.11:8081) raises an
exception (connection refused, timeout), it was silently swallowed by
the catch-all except block, making it look like a sidecar/switch
failure when it was actually a network-level error.

Now prints: 'PROXY EXCEPTION on primary <url>: <ExceptionType>: <msg>'
2026-06-16 21:32:36 +00:00
root
5c1753dfef fix: log sidecar switch failures + fix scoping bug in proxy handler
Two changes to debug the fallback-to-LXC issue:

1. Added debug logging on switch failure: prints the profile name,
   sidecar response status, and error message. Also calls
   circuit_record_failure() so subsequent requests don't wait the
   full 120-second timeout before falling back.

2. Fixed scoping bug: sidecar_status was only defined inside the
   else branch of the circuit breaker check. Initialized to None
   at function scope alongside target_url and error to prevent
   NameError when circuit is open.
2026-06-16 21:25:42 +00:00
root
f2e62f60e6 fix: /api/show GET support, /v1 root handler, and proxy debug logging
Three changes to debug and fix Hermes Desktop integration:

1. /api/show: Added GET handler alongside existing POST handler.
   Hermes Desktop probes with GET ?model=xxx, not POST body.
   Refactored shared lookup logic into _ollama_show_lookup().

2. /v1 root: Added handler returning basic info. Hermes Desktop
   probes this URL and ERR_CONNECTION_REFUSED was blocking
   full provider validation.

3. Proxy execute(): Added debug logging for non-200 responses.
   Prints the backend URL, status code, and first 500 bytes of body
   to help diagnose why llama-server returns 400 on
   /v1/chat/completions.
2026-06-16 21:16:45 +00:00
root
d935339280 fix: report actual profile context size in /api/show probe endpoint
Hermes Desktop reads the context size from /api/show's 'parameters'
field.  This was hardcoded to 'num_ctx 4096' for every model, causing
'context too small' errors when the user's system prompt + conversation
exceeded 4K tokens.

Now extracts the actual ctx-size from the profile's flags and returns
the correct value (e.g. 'num_ctx 131072' for the 128K profiles).
2026-06-16 21:04:40 +00:00
root
4ee85972ec fix: convert underscores to hyphens in llama-server flag names, fix n_ctx→ctx-size rename
Two changes to fix 'error: invalid argument: --n-ctx' during model switch:

1. sidecar/app.py: Added _flag_key() converter that normalises
   underscores to hyphens in flag names and handles the n_ctx→ctx-size
   rename. The code now converts e.g. n_gpu_layers → n-gpu-layers,
   top_p → top-p, top_k → top-k, min_p → min-p before passing to
   llama-server CLI.

2. deploy/manifest.yaml: Updated all 20 profiles to use correct
   llama-server flag names: n_ctx→ctx-size, n_gpu_layers→n-gpu-layers,
   top_p→top-p, top_k→top-k, min_p→min-p. All flags now use hyphens,
   matching what llama-server actually accepts.
2026-06-16 20:54:32 +00:00
root
1551c281c2 fix: move llama-server stderr log from /tmp to working dir (ReadWritePaths compat)
The sidecar systemd service has ProtectSystem=strict and
ReadWritePaths=/home/bigt/AI/llm, making /tmp read-only. Writing
/tmp/llama-server-stderr.log failed with EROFS.

Changed LLAMA_STDERR_LOG to os.path.join(dirname(MANIFEST_PATH), ...),
resolving to /home/bigt/AI/llm/llama-server-stderr.log, which is
within the allowed ReadWritePaths.
2026-06-16 20:36:10 +00:00
root
37fee5341e fix: capture llama-server stderr, fix YAML boolean flag conversion, reduce polling timeout
Three fixes for the model-not-loading bug:

1. **YAML boolean → CLI flag bug**: YAML parses 'on'/'off'/'yes'/'no' as Python
   bools. str(True)='True' which is INVALID for llama.cpp's --flash-attn flag
   (expects 'on'/'off'/'auto'). Added _flag_value() converter that maps bools
   to 'on'/'off' strings.

2. **llama-server stderr was DEVNULL**: All error messages (bad model path,
   OOM, invalid flag) were invisible. Now captured to /tmp/llama-server-stderr.log
   and dumped to the sidecar log on failure.

3. **Reduce polling timeout**: 240 retries × 0.5s = 120s hang. Reduced to
   60 retries × 0.5s = 30s. Still dumps stderr + exit code on failure.

4. **Manifest VRAM fix**: gemma4-26b-compact-long-128k used q8_0 KV cache at
   128K context (~24GB on 24GB RTX 3090 — borderline OOM). Changed to q4_0
   (~18GB, comfortable).
2026-06-16 00:06:45 +00:00
root
903f06c634 feat: add sync_models.py script to auto-update Hermes custom_providers from router model list 2026-06-15 21:10:36 +00:00
root
95c87a764b fix: remove non-existent models from manifest (qwen-3-8b, llama-4-maverick), add 3 newly discovered models 2026-06-15 16:38:17 +00:00
root
36abbf573e fix: unbuffer sidecar stdout so logs appear in journalctl 2026-06-15 16:25:58 +00:00
1e9305395e Fixed llama-server path 2026-06-15 17:01:53 +01:00
root
7e86a30bd8 fix: resolve port conflict between sidecar and llama-server
Sidecar and llama-server were both configured on port 8080, causing
llama-server to fail on startup (port already in use).

- sidecar/app.py: LLAMA_SERVER_PORT → 8081 (sidecar stays on 8080)
- docker-compose.yml: MAIN_PC_URL → port 8081 (router sends chat
  requests to llama-server, not the sidecar)
2026-06-15 15:31:31 +00:00
root
2c23faa4a1 fix: add probe endpoints and no-model fallback for Hermes Desktop compatibility
Hermes Desktop sends probe requests to validate providers before allowing
model switching. The router was returning 503 for all of these because
the catch-all proxy requires a 'model' field in the request body.

Added explicit handlers for:
- GET /v1/models/{model_id} — OpenAI single-model lookup
- GET /api/tags — Ollama model list discovery
- POST /api/show — Ollama model info
- GET /api/v1/models — Ollama-compatible model list
- GET /v1/props, GET /props — llama.cpp server properties
- GET /version — llama.cpp version

Also fixed the catch-all proxy to route requests with no model body to
the currently active backend instead of returning 503.
2026-06-15 15:22:15 +00:00
af12370632 changed llama-server location 2026-06-15 16:10:49 +01:00
root
1ef8a497f6 fix: update docker-compose.yml SIDECAR_URL to port 8080 2026-06-15 13:23:09 +00:00
root
45417068ae fix: change sidecar port from 8081 to 8080
The sidecar is deployed on port 8080 instead of 8081. Update all:
- Default SIDECAR_PORT in sidecar/app.py
- Default SIDECAR_URL in main.py (router)
- deploy/llm-sidecar.service Environment
- deploy/README.md (.env example + config table)
- All 7 test files (conftest, circuit-breaker, fallback, queue,
  model-detection, sse-progress, v1-models)
2026-06-15 13:17:31 +00:00
b7079fa199 fixed port and conflict 2026-06-15 14:07:18 +01:00
root
e14d2c62da fix: use venv for sidecar deps, add missing deploy steps
- llm-sidecar.service: use /home/bigt/AI/llm/venv/bin/uvicorn instead of
  global python3 -m uvicorn (avoids 'No module named uvicorn' error)
- deploy/README.md: add steps to copy sidecar/ package, create venv,
  and pip install requirements.txt
2026-06-15 13:02:34 +00:00
555a887b4e fixed port 2026-06-15 13:43:43 +01:00
39a8f09232 Merge pull request 'feat: add 15 model profiles to manifest.yaml' (#18) from feature/add-model-profiles into master
Reviewed-on: https://ghituai.chiabur.xyz/doru/intelligence-router/pulls/18
2026-06-15 15:40:48 +03:00
root
e9790c00dc feat: add 15 model profiles to manifest.yaml
- Qwen3.6-27B: 3 profiles (balanced/thinking/extended)
- Gemma 4 12B: 4 profiles (Q6_K_XL and IQ4_XS variants)
- Gemma 4 26B-A4B: 3 profiles (Q4_K_M and IQ4_XS)
- Qwen3.6-35B-A3B: 3 profiles (fast/thinking/extended, non-MTP)
- Uncensored: 3 profiles (HauhauCS, Genesis APEX)
- Add pytest.ini for test discovery
- All profiles use KV cache quantization (q8_0/q4_0) for 64K-128K context
- Embedded sampling parameters per model family
- Based on research from r/LocalLLaMA, Unsloth benchmarks, HF model cards
2026-06-15 12:34:46 +00:00
root
4914363089 Epic: Model Switching via Sidecar — Issues #4-#7 + #8 deployment
Issue #4: Automatic model detection and switch
- Router extracts model from chat body, queries sidecar, triggers switch on mismatch
- Matching active model routes directly to Main PC
- No active model triggers cold start switch
- Tests: 4 test_router_model_detection.py

Issue #5: SSE switch progress feedback
- _sse_format() correctly serializes SSE events
- sse_progress_stream() generates phase progression events
- Proxy yields SSE events then actual response
- Tests: 3 test_router_sse_progress.py

Issue #6: Circuit breaker + OpenRouter fallback
- Circuit tracks Sidecar failures, opens after MAX_RECOVERY_ATTEMPTS (3)
- OpenRouter API key from env, no longer uses x-intelligence-level header
- Fixes: OPENROUTER_BASE, SSE format, circuit state isolation
- Tests: 7 test_router_circuit_breaker.py

Issue #7: LXC fallback chain completion
- Full fallback: Main PC → OpenRouter → LXC
- Each backend health-checked via /v1/models before routing
- All backends down → 503 response
- Fixed: execute() wrapped in try/except to trigger fallback chain
- Tests: 3 test_router_fallback_lxc.py

Issue #8: Systemd service deployment
- deploy/llm-sidecar.service: systemd unit with Restart=always
- deploy/manifest.yaml: example manifest with 3 profiles
- deploy/README.md: deployment instructions
- Updated: docker-compose.yml, requirements.txt, Dockerfile

Test framework improvements:
- tests/conftest.py: shared URL patches for all router tests
- Fixed global state pollution in circuit breaker tests
- Fixed test sidecar switch test (AsyncMock for async function)

Total: 42 tests passing
2026-06-15 01:13:36 +00:00
root
c491779248 Epic: Model Switching via Sidecar — Issues #2-#3
Issue #2: Manifest schema + Sidecar foundation
- sidecar/manifest.py: YAML manifest loading and profile validation
- sidecar/app.py: FastAPI sidecar service with /models/available, /models/status endpoints
- Router GET /v1/models: proxies to sidecar, returns OpenAI-compatible model list
- Tests: 12 manifest tests, 6 sidecar endpoint tests, 3 router tests (21 total)

Issue #3: Sidecar model switch + Router request queue
- Sidecar POST /models/switch: stops current llama-server, starts new one, polls for readiness
- Switch lock prevents concurrent switches (threading.Lock for TestClient compatibility)
- Router request queue: max 10 requests, 120s hard timeout, 429 when full
- Router automatic model detection: extracts model from chat body, matches against sidecar status
- Full proxy endpoint with Sidecar → Main PC routing and fallback chain
- Tests: 5 sidecar switch tests, 4 queue tests, 3 router integration tests (12 total)

Total: 33 tests, all passing
2026-06-15 00:49:24 +00:00
root
b2031d8b7a Added next changes 2026-06-15 00:09:31 +00:00
712fe041b1 test 2026-06-09 19:54:03 +01:00
1a7dd550ec added debug 2026-06-09 18:05:10 +01:00
d7090b1644 Fix build context, port conflict, and improve proxy/health-check logic 2026-06-09 17:34:07 +01:00
cb01b42f38 Cleanup: Remove redundant llama-slm service and use LXC IP 2026-06-09 12:41:32 +01:00
4ea94f7d60 Update IPs for Main PC and LXC Fallback Brain 2026-06-09 12:37:34 +01:00
Chiabur Aiode
8fab2f3801 .env 2026-06-09 13:57:22 +03:00
Chiabur Aiode
038e8f9f7c gitignore 2026-06-09 13:54:18 +03:00
0e05390be2 Initial commit: migrate intelligence-router files 2026-06-09 11:48:43 +01:00