Commit Graph

8 Commits

Author SHA1 Message Date
root
1551c281c2 fix: move llama-server stderr log from /tmp to working dir (ReadWritePaths compat)
The sidecar systemd service has ProtectSystem=strict and
ReadWritePaths=/home/bigt/AI/llm, making /tmp read-only. Writing
/tmp/llama-server-stderr.log failed with EROFS.

Changed LLAMA_STDERR_LOG to os.path.join(dirname(MANIFEST_PATH), ...),
resolving to /home/bigt/AI/llm/llama-server-stderr.log, which is
within the allowed ReadWritePaths.
2026-06-16 20:36:10 +00:00
root
37fee5341e fix: capture llama-server stderr, fix YAML boolean flag conversion, reduce polling timeout
Three fixes for the model-not-loading bug:

1. **YAML boolean → CLI flag bug**: YAML parses 'on'/'off'/'yes'/'no' as Python
   bools. str(True)='True' which is INVALID for llama.cpp's --flash-attn flag
   (expects 'on'/'off'/'auto'). Added _flag_value() converter that maps bools
   to 'on'/'off' strings.

2. **llama-server stderr was DEVNULL**: All error messages (bad model path,
   OOM, invalid flag) were invisible. Now captured to /tmp/llama-server-stderr.log
   and dumped to the sidecar log on failure.

3. **Reduce polling timeout**: 240 retries × 0.5s = 120s hang. Reduced to
   60 retries × 0.5s = 30s. Still dumps stderr + exit code on failure.

4. **Manifest VRAM fix**: gemma4-26b-compact-long-128k used q8_0 KV cache at
   128K context (~24GB on 24GB RTX 3090 — borderline OOM). Changed to q4_0
   (~18GB, comfortable).
2026-06-16 00:06:45 +00:00
root
36abbf573e fix: unbuffer sidecar stdout so logs appear in journalctl 2026-06-15 16:25:58 +00:00
1e9305395e Fixed llama-server path 2026-06-15 17:01:53 +01:00
root
7e86a30bd8 fix: resolve port conflict between sidecar and llama-server
Sidecar and llama-server were both configured on port 8080, causing
llama-server to fail on startup (port already in use).

- sidecar/app.py: LLAMA_SERVER_PORT → 8081 (sidecar stays on 8080)
- docker-compose.yml: MAIN_PC_URL → port 8081 (router sends chat
  requests to llama-server, not the sidecar)
2026-06-15 15:31:31 +00:00
af12370632 changed llama-server location 2026-06-15 16:10:49 +01:00
root
45417068ae fix: change sidecar port from 8081 to 8080
The sidecar is deployed on port 8080 instead of 8081. Update all:
- Default SIDECAR_PORT in sidecar/app.py
- Default SIDECAR_URL in main.py (router)
- deploy/llm-sidecar.service Environment
- deploy/README.md (.env example + config table)
- All 7 test files (conftest, circuit-breaker, fallback, queue,
  model-detection, sse-progress, v1-models)
2026-06-15 13:17:31 +00:00
root
c491779248 Epic: Model Switching via Sidecar — Issues #2-#3
Issue #2: Manifest schema + Sidecar foundation
- sidecar/manifest.py: YAML manifest loading and profile validation
- sidecar/app.py: FastAPI sidecar service with /models/available, /models/status endpoints
- Router GET /v1/models: proxies to sidecar, returns OpenAI-compatible model list
- Tests: 12 manifest tests, 6 sidecar endpoint tests, 3 router tests (21 total)

Issue #3: Sidecar model switch + Router request queue
- Sidecar POST /models/switch: stops current llama-server, starts new one, polls for readiness
- Switch lock prevents concurrent switches (threading.Lock for TestClient compatibility)
- Router request queue: max 10 requests, 120s hard timeout, 429 when full
- Router automatic model detection: extracts model from chat body, matches against sidecar status
- Full proxy endpoint with Sidecar → Main PC routing and fallback chain
- Tests: 5 sidecar switch tests, 4 queue tests, 3 router integration tests (12 total)

Total: 33 tests, all passing
2026-06-15 00:49:24 +00:00