feat: 로컬 LLM 채팅 기능 추가 (Ollama + vLLM, 스트리밍, MCP 도구 호출)
- OllamaController: Ollama/vLLM 프록시 API (채팅, 스트리밍, 모델 목록, 설정) - UI: 새 대화 탭, 세션 관리, Markdown 렌더링, 스트리밍 응답 - vLLM: OpenAI-compatible API 지원, MCP function calling 통합 - Fix: McpClient DI 팩토리 등록 (HttpClient BaseAddress 문제 해결) - Fix: llm-model.json 직렬화 JsonSerializer 사용 - Fix: nl2sql_worker KST 시간대 표시 (AT TIME ZONE Asia/Seoul) - Program.cs: Ollama/vLLM HttpClient 등록 (1800s timeout)
This commit is contained in:
28
gemma4-dflash-run.sh
Normal file
28
gemma4-dflash-run.sh
Normal file
@@ -0,0 +1,28 @@
|
||||
docker run -d --name gemma4-dflash \
|
||||
--gpus all --network host --ipc host \
|
||||
--ulimit memlock=-1 --ulimit stack=67108864 \
|
||||
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
|
||||
-e PORT=8000 \
|
||||
-e SERVED_MODEL_NAME=google/gemma-4-31B-it-vllm-fp8-dflash-16k \
|
||||
-e HF_TOKEN=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 \
|
||||
-e VLLM_DISABLE_COMPILE_CACHE=1 \
|
||||
--entrypoint "" \
|
||||
gemma4-dflash:arm64 \
|
||||
bash -c "
|
||||
python3 /opt/gemma4-dflash-spark-vllm/scripts/patch_vllm_gb10_gemma4_dflash_runtime.py
|
||||
exec vllm serve google/gemma-4-31B-it \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--served-model-name google/gemma-4-31B-it-vllm-fp8-dflash-16k \
|
||||
--trust-remote-code \
|
||||
--max-model-len 16384 \
|
||||
--gpu-memory-utilization 0.80 \
|
||||
--quantization fp8 \
|
||||
--tensor-parallel-size 1 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--enforce-eager \
|
||||
--speculative-config '{"model":"RedHatAI/gemma-4-31B-it-speculator.dflash","num_speculative_tokens":8,"method":"dflash"}' \
|
||||
--limit-mm-per-prompt '{"image":0,"video":0}' \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser hermes
|
||||
"
|
||||
Reference in New Issue
Block a user