A production-ready llama-cpp-python server packaged with Docker images for multiple compute backends:
- CPU (OpenBLAS)
- CUDA (NVIDIA GPU)
- XPU (Intel SYCL, experimental)
- ROCm (AMD GPU, experimental)
- Vulkan (cross-vendor GPU, experimental)
- OpenCL (cross-vendor GPU, experimental)
This repository includes:
- prebuilt-oriented Dockerfiles for each backend;
- multi-model runtime configurations (
config-cpu.json,config-cuda.json,config-xpu.json,config-rocm.json,config-vulkan.json,config-opencl.json); - a configuration-structure validation test suite.
- Docker Desktop (Windows/macOS) or Docker Engine (Linux)
- Git
- Python 3.10+ (for local tests)
Quick install (PowerShell/CMD as Administrator):
choco install git jq curl -ysudo apt update
sudo apt install -y wget jq gitgit clone https://github.com/Smartappli/LLM_SERVER.git
cd LLM_SERVERRun from the Docker/ directory:
cd Docker
create_docker_volume.batcd Docker
chmod +x create_docker_volume.sh
./create_docker_volume.shRun the following commands from the
Docker/directory.
cd cpu
docker build -t smartappli/llama-cpp-python-server-cpu:1.0 -f cpu.Dockerfile ..cd ../cuda
docker build -t smartappli/llama-cpp-python-server-cuda:1.0 -f cuda.Dockerfile ..cd ../xpu
docker build -t smartappli/llama-cpp-python-server-xpu:1.0 -f xpu.Dockerfile ..cd ../rocm
docker build -t smartappli/llama-cpp-python-server-rocm:1.0 -f rocm.Dockerfile ..cd ../vulkan
docker build -t smartappli/llama-cpp-python-server-vulkan:1.0 -f vulkan.Dockerfile ..cd ../opencl
docker build -t smartappli/llama-cpp-python-server-opencl:1.0 -f opencl.Dockerfile ..All containers mount the LLM_SERVER Docker volume to /models.
docker run --rm -p 8008:8008 -v LLM_SERVER:/models smartappli/llama-cpp-python-server-cpu:1.0docker run --rm --gpus all -p 8008:8008 -v LLM_SERVER:/models smartappli/llama-cpp-python-server-cuda:1.0docker run --rm -p 8008:8008 -v LLM_SERVER:/models smartappli/llama-cpp-python-server-xpu:1.0docker run --rm -p 8008:8008 -v LLM_SERVER:/models smartappli/llama-cpp-python-server-rocm:1.0docker run --rm -p 8008:8008 -v LLM_SERVER:/models smartappli/llama-cpp-python-server-vulkan:1.0docker run --rm -p 8008:8008 -v LLM_SERVER:/models smartappli/llama-cpp-python-server-opencl:1.0OpenAI-compatible endpoint:
http://localhost:8008/v1
Run from repository root:
python -m unittest -v tests/test_configs.pyDocker/main.py sends a request to the local server:
python Docker/main.pyEnsure the server is already running at
http://localhost:8008/v1.
pip install -r requirements.txt- Port already in use: update
-p 8008:8008(for example-p 8010:8008). - NVIDIA GPU not detected: verify Docker setup + NVIDIA Container Toolkit, then test with
docker run --gpus all .... - Intel XPU not detected: verify Intel GPU drivers on host and device access in Docker.
- ROCm not detected: verify AMD ROCm drivers on host and device permissions (
/dev/kfd,/dev/dri). - Vulkan not detected: verify host Vulkan stack (
vulkaninfo) and GPU device access. - OpenCL not detected: verify vendor OpenCL runtime and visibility using
clinfo. - Models not found: verify the
LLM_SERVERvolume contains the expected model files.
Docker/cpu/cpu.Dockerfile: CPU imageDocker/cuda/cuda.Dockerfile: CUDA imageDocker/cpu/config-cpu.json: CPU multi-model configurationDocker/cuda/config-cuda.json: CUDA multi-model configurationDocker/xpu/xpu.Dockerfile: Intel XPU image (experimental)Docker/xpu/config-xpu.json: XPU multi-model configurationDocker/rocm/rocm.Dockerfile: AMD ROCm image (experimental)Docker/rocm/config-rocm.json: ROCm multi-model configurationDocker/vulkan/vulkan.Dockerfile: Vulkan image (experimental)Docker/vulkan/config-vulkan.json: Vulkan multi-model configurationDocker/opencl/opencl.Dockerfile: OpenCL image (experimental)Docker/opencl/config-opencl.json: OpenCL multi-model configurationtests/test_configs.py: configuration unit testsDocker/main.py: request smoke-test script
A helper script is available to discover medical GGUF models from Hugging Face and optionally download them:
python Docker/download_medical_models.py --output-dir modelsDownload selected files (preferred quantization per model):
python Docker/download_medical_models.py --download --output-dir modelsDownload all GGUF files for each discovered medical model:
python Docker/download_medical_models.py --download --all-files --output-dir modelsYou can pass a Hugging Face token with
--token <HF_TOKEN>or theHF_TOKENenvironment variable for gated/private models.
A web interface is available to discover and download medical GGUF models from Hugging Face:
python -m pip install -r requirements.txt
uv run --with-requirements requirements.txt granian --interface asgi --host 0.0.0.0 --port 8010 medical_ui.asgi:application --app-dir medical_uiOpen:
http://localhost:8010/(served by ASGI/granian)
From the form you can:
- set medical keywords;
- choose search limit;
- enable/disable downloads;
- choose one preferred GGUF file or all files;
- provide a Hugging Face token for gated repositories.
Security-oriented Django settings are environment-driven:
export DJANGO_ENV=prod
export DJANGO_DEBUG=false
export DJANGO_SECRET_KEY="replace-with-a-long-random-secret"
export DJANGO_ALLOWED_HOSTS="your-domain.com,api.your-domain.com"For local development, defaults remain developer-friendly (DJANGO_ENV=dev).
UI safety notes:
- CLI and Django both reuse shared business logic from
services/medical_models.py; - downloads/discovery are processed in background jobs (non-blocking HTTP request cycle);
- each job has a status (
queued,running,done,failed) visible on the page; output_diris sanitized server-side and restricted under<repo>/model_downloads.