Files
ExperionCrawler/gemma4-dflash-run.sh
windpacer 35136ba91e feat: 로컬 LLM 채팅 기능 추가 (Ollama + vLLM, 스트리밍, MCP 도구 호출)
- OllamaController: Ollama/vLLM 프록시 API (채팅, 스트리밍, 모델 목록, 설정)
- UI: 새 대화 탭, 세션 관리, Markdown 렌더링, 스트리밍 응답
- vLLM: OpenAI-compatible API 지원, MCP function calling 통합
- Fix: McpClient DI 팩토리 등록 (HttpClient BaseAddress 문제 해결)
- Fix: llm-model.json 직렬화 JsonSerializer 사용
- Fix: nl2sql_worker KST 시간대 표시 (AT TIME ZONE Asia/Seoul)
- Program.cs: Ollama/vLLM HttpClient 등록 (1800s timeout)
2026-05-12 19:59:31 +09:00

29 lines
1.0 KiB
Bash

docker run -d --name gemma4-dflash \
--gpus all --network host --ipc host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-e PORT=8000 \
-e SERVED_MODEL_NAME=google/gemma-4-31B-it-vllm-fp8-dflash-16k \
-e HF_TOKEN=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 \
-e VLLM_DISABLE_COMPILE_CACHE=1 \
--entrypoint "" \
gemma4-dflash:arm64 \
bash -c "
python3 /opt/gemma4-dflash-spark-vllm/scripts/patch_vllm_gb10_gemma4_dflash_runtime.py
exec vllm serve google/gemma-4-31B-it \
--host 0.0.0.0 \
--port 8000 \
--served-model-name google/gemma-4-31B-it-vllm-fp8-dflash-16k \
--trust-remote-code \
--max-model-len 16384 \
--gpu-memory-utilization 0.80 \
--quantization fp8 \
--tensor-parallel-size 1 \
--max-num-batched-tokens 16384 \
--enforce-eager \
--speculative-config '{"model":"RedHatAI/gemma-4-31B-it-speculator.dflash","num_speculative_tokens":8,"method":"dflash"}' \
--limit-mm-per-prompt '{"image":0,"video":0}' \
--enable-auto-tool-choice \
--tool-call-parser hermes
"