Skip to content

grctest/fastapi-gemma-translate

Repository files navigation

fastapi-gemma-translate

This project provides a robust REST API built with FastAPI and Docker to manage and interact with Google's Gemma Translate AI Models for AI string translations on-device.

Key Features

  • Translation services
  • Support for multiple models
  • Automatic API Docs: Interactive API documentation powered by Swagger UI and ReDoc.

Technology Stack

  • FastAPI for the core web framework.
  • Uvicorn as the ASGI server.
  • Docker for containerization and easy deployment.
  • Pydantic for data validation and settings management.

Getting Started

Prerequisites

1. Set Up the Python Environment

Create and activate a Conda environment:

conda create -n translate python=3.11
conda activate translate

Install the hf tool to download the models:

pip install "fastapi[standard]" "uvicorn[standard]" httpx llama-cpp-python huggingface_hub

We're going to fetch the GGUF model fromats from these repositories:

https://huggingface.co/mradermacher/translategemma-27b-it-GGUF
https://huggingface.co/mradermacher/translategemma-12b-it-GGUF
https://huggingface.co/mradermacher/translategemma-4b-it-GGUF

Download one of the following Gemma Translate models:

Gemma 4B:

hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.IQ4_XS.gguf --local-dir app/models/translategemma-4b-it.IQ4_XS
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q2_K.gguf --local-dir app/models/translategemma-4b-it.Q2_K
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q3_K_L.gguf --local-dir app/models/translategemma-4b-it.Q3_K_L
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q3_K_M.gguf --local-dir app/models/translategemma-4b-it.Q3_K_M
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q3_K_S.gguf --local-dir app/models/translategemma-4b-it.Q3_K_S
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q4_K_M.gguf --local-dir app/models/translategemma-4b-it.Q4_K_M
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q4_K_S.gguf --local-dir app/models/translategemma-4b-it.Q4_K_S
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q5_K_M.gguf --local-dir app/models/translategemma-4b-it.Q5_K_M
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q5_K_S.gguf --local-dir app/models/translategemma-4b-it.Q5_K_S
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q6_K.gguf --local-dir app/models/translategemma-4b-it.Q6_K
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q8_0.gguf --local-dir app/models/translategemma-4b-it.Q8_0

Gemma 12B:

hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.IQ4_XS.gguf --local-dir app/models/translategemma-12b-it.IQ4_XS
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q2_K.gguf --local-dir app/models/translategemma-12b-it.Q2_K
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q3_K_L.gguf --local-dir app/models/translategemma-12b-it.Q3_K_L
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q3_K_M.gguf --local-dir app/models/translategemma-12b-it.Q3_K_M
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q3_K_S.gguf --local-dir app/models/translategemma-12b-it.Q3_K_S
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q4_K_M.gguf --local-dir app/models/translategemma-12b-it.Q4_K_M
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q4_K_S.gguf --local-dir app/models/translategemma-12b-it.Q4_K_S
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q5_K_M.gguf --local-dir app/models/translategemma-12b-it.Q5_K_M
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q5_K_S.gguf --local-dir app/models/translategemma-12b-it.Q5_K_S
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q6_K.gguf --local-dir app/models/translategemma-12b-it.Q6_K
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q8_0.gguf --local-dir app/models/translategemma-12b-it.Q8_0

Gemma 27B:

hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.IQ4_XS.gguf --local-dir app/models/translategemma-27b-it.IQ4_XS
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q2_K.gguf --local-dir app/models/translategemma-27b-it.Q2_K
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q3_K_L.gguf --local-dir app/models/translategemma-27b-it.Q3_K_L
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q3_K_M.gguf --local-dir app/models/translategemma-27b-it.Q3_K_M
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q3_K_S.gguf --local-dir app/models/translategemma-27b-it.Q3_K_S
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q4_K_M.gguf --local-dir app/models/translategemma-27b-it.Q4_K_M
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q4_K_S.gguf --local-dir app/models/translategemma-27b-it.Q4_K_S
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q5_K_M.gguf --local-dir app/models/translategemma-27b-it.Q5_K_M
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q5_K_S.gguf --local-dir app/models/translategemma-27b-it.Q5_K_S
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q6_K.gguf --local-dir app/models/translategemma-27b-it.Q6_K
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q8_0.gguf --local-dir app/models/translategemma-27b-it.Q8_0

Running the Application

Using Docker (Recommended)

This is the easiest and recommended way to run the application.

  1. Build the Docker image:

    docker build -t fastapi_gemma_translate .
    docker build -t grctest/fastapi_gemma_translate .
  2. Run the Docker container: This command runs the container in detached mode (-d) and maps port 8080 on your host to port 8080 in the container.

    docker run -d --name ai_container_cpu -p 127.0.0.1:8080:8080 -v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models fastapi_gemma_translate

Alternatively you can pull and run my docker image:

  1. Pull the Docker image:

    docker image pull grctest/fastapi_gemma_translate
  2. Run the Docker container: This command runs the container in detached mode (-d) and maps port 8080 on your host to port 8080 in the container.

    docker run -d --name ai_container_cpu -p 127.0.0.1:8080:8080 -v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models grctest/fastapi_gemma_translate

The above were for CPU-only mode, if you want Nvidia CUDA GPU support you'll need to use the CudaDockerfile, either by:

  1. Building the Docker image:

    docker build -t fastapi_gemma_translate_cuda:legacy -f LegacyCudaDockerfile .
    docker build -t fastapi_gemma_translate_cuda:mainstream -f MainstreamCudaDockerfile .
    docker build -t fastapi_gemma_translate_cuda:future -f FutureCudaDockerfile .
    docker build -t grctest/fastapi_gemma_translate_cuda:legacy -f LegacyCudaDockerfile .
    docker build -t grctest/fastapi_gemma_translate_cuda:mainstream -f MainstreamCudaDockerfile .
    docker build -t grctest/fastapi_gemma_translate_cuda:future -f FutureCudaDockerfile .
  2. Pulling the Docker image:

    docker image pull grctest/fastapi_gemma_translate_cuda:legacy

    Then you need to use the GPU flag:

    docker run --gpus all -d --name ai_container_gpu -p 127.0.0.1:8080:8080 -v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models -e LLAMA_N_GPU_LAYERS=-1 fastapi_gemma_translate_cuda:legacy
    docker run --gpus all -d --name ai_container_cuda -p 127.0.0.1:8080:8080 -v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models -e LLAMA_N_GPU_LAYERS=-1 grctest/fastapi_gemma_translate_cuda:legacy

Note: The C:/Users/username/Desktop/git/fastapi-gemma-translate/_models folder can be replaced by the path you've downloaded the gguf folders+files to.

Container GPU variants:

  • Legacy: Pascal / 10xx Nvidia cards

  • Mainstream: Turing to Ada / 20xx, 30xx, 40xx, A100 Nvidia cards

  • Future: Blackwell / 50xx Nvidia cards

Local Development

For development, you can run the application directly with Uvicorn, which enables auto-reloading.

uvicorn app.main:app --host 0.0.0.0 --port 8080 --reload

Concurrency Tuning (Recommended for CUDA stability)

llama-cpp-python GGUF inference can become unstable under heavy parallel requests on some CUDA setups. This service includes a configurable inference gate shared by /translate, /experimental_translation, and /translate_image.

# safest default: serialize inference
set LLAMA_MAX_CONCURRENT_INFERENCES=1

# max wait in seconds before returning HTTP 503
set LLAMA_INFERENCE_ACQUIRE_TIMEOUT_SECONDS=45

Notes:

  • Keep LLAMA_MAX_CONCURRENT_INFERENCES=1 for 12B/27B models unless higher values are validated on your hardware.
  • If queue wait exceeds timeout, the API returns 503 instead of allowing requests to pile up indefinitely.
  • These can be passed alongside existing runtime env vars like LLAMA_N_GPU_LAYERS.

Example:

docker run --gpus all -d --name ai_container_cuda -p 127.0.0.1:8080:8080 \
    -v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models \
    -e LLAMA_N_GPU_LAYERS=-1 \
    -e LLAMA_MAX_CONCURRENT_INFERENCES=1 \
    -e LLAMA_INFERENCE_ACQUIRE_TIMEOUT_SECONDS=45 \
    grctest/fastapi_gemma_translate_cuda:legacy

API Usage

Once the server is running, you can access the interactive API documentation:

Model Lifecycle (Required)

Model loading is explicit. Translation endpoints will reject requests unless the requested model is already loaded.

  1. Load a model:
curl -X POST "http://127.0.0.1:8080/model/load" \
    -H "Content-Type: application/json" \
    -d '{"model":"translategemma-4b-it-Q8_0"}'

Load a model with vision support (mmproj can be relative to the model folder or absolute):

curl -X POST "http://127.0.0.1:8080/model/load" \
    -H "Content-Type: application/json" \
    -d '{"model":"translategemma-4b-it-Q8_0","mmproj":"translategemma-4b-it.mmproj-f16.gguf"}'
  1. Check model status:
curl "http://127.0.0.1:8080/model/status"
curl "http://127.0.0.1:8080/model/status?model=translategemma-4b-it-Q8_0"

/model/status includes:

  • loaded: whether any model is loaded
  • loading: whether a load is currently in progress
  • loaded_model: currently loaded model name
  • vision_enabled: whether the currently loaded model can process images

Text Translation Endpoints

  • POST /translate (stable locale list)
  • POST /experimental_translation (stable + experimental locale list)

Both endpoints reject if:

  • no model is loaded
  • a model is still loading
  • the requested model does not match the loaded model

Image Translation Endpoints

This API supports image translation via multipart upload using llama-cpp-python vision chat formatting.

  • POST /translate_image (stable locale list)

Notes:

  • Upload images with multipart/form-data as field file.
  • The image stays local to the server process and is sent to the model as a Base64 data URI.
  • The loaded model must be vision-enabled.
  • Vision is enabled only when /model/load is called with an mmproj value.
  • Image translation requests are rejected if the currently loaded model was not loaded with mmproj.

Example (stable image route):

curl -X POST "http://127.0.0.1:8080/translate_image" \
    -F "file=@C:/path/to/image.jpg" \
    -F "model=translategemma-4b-it-Q8_0" \
    -F "source_lang_code=en" \
    -F "target_lang_code=es" \
    -F "max_new_tokens=200"

Project showcase

MetalGlot

MetalGlot is a private, local-first AI translation desktop app for developers and creators who desire a secure translation workflow without relying on cloud services. Built primarily for software localization and structured i18n content, MetalGlot helps users translate text, locale files, markdown, subtitles, and image-based content on their own hardware while avoiding telemetry, protecting intellectual property, and eliminating recurring per-token costs.


License

This project is licensed under the MIT License. See the LICENSE file for details.