fastapi-gemma-translate

This project provides a robust REST API built with FastAPI and Docker to manage and interact with Google's Gemma Translate AI Models for AI string translations on-device.

Key Features

Translation services
Support for multiple models
Automatic API Docs: Interactive API documentation powered by Swagger UI and ReDoc.

Technology Stack

FastAPI for the core web framework.
Uvicorn as the ASGI server.
Docker for containerization and easy deployment.
Pydantic for data validation and settings management.

Getting Started

Prerequisites

Docker Desktop
Conda (or another Python environment manager)
Python 3.10+

1. Set Up the Python Environment

Create and activate a Conda environment:

conda create -n translate python=3.11
conda activate translate

Install the hf tool to download the models:

pip install "fastapi[standard]" "uvicorn[standard]" httpx llama-cpp-python huggingface_hub

We're going to fetch the GGUF model fromats from these repositories:

https://huggingface.co/mradermacher/translategemma-27b-it-GGUF
https://huggingface.co/mradermacher/translategemma-12b-it-GGUF
https://huggingface.co/mradermacher/translategemma-4b-it-GGUF

Download one of the following Gemma Translate models:

Gemma 4B:

hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.IQ4_XS.gguf --local-dir app/models/translategemma-4b-it.IQ4_XS
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q2_K.gguf --local-dir app/models/translategemma-4b-it.Q2_K
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q3_K_L.gguf --local-dir app/models/translategemma-4b-it.Q3_K_L
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q3_K_M.gguf --local-dir app/models/translategemma-4b-it.Q3_K_M
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q3_K_S.gguf --local-dir app/models/translategemma-4b-it.Q3_K_S
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q4_K_M.gguf --local-dir app/models/translategemma-4b-it.Q4_K_M
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q4_K_S.gguf --local-dir app/models/translategemma-4b-it.Q4_K_S
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q5_K_M.gguf --local-dir app/models/translategemma-4b-it.Q5_K_M
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q5_K_S.gguf --local-dir app/models/translategemma-4b-it.Q5_K_S
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q6_K.gguf --local-dir app/models/translategemma-4b-it.Q6_K
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q8_0.gguf --local-dir app/models/translategemma-4b-it.Q8_0

Gemma 12B:

hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.IQ4_XS.gguf --local-dir app/models/translategemma-12b-it.IQ4_XS
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q2_K.gguf --local-dir app/models/translategemma-12b-it.Q2_K
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q3_K_L.gguf --local-dir app/models/translategemma-12b-it.Q3_K_L
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q3_K_M.gguf --local-dir app/models/translategemma-12b-it.Q3_K_M
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q3_K_S.gguf --local-dir app/models/translategemma-12b-it.Q3_K_S
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q4_K_M.gguf --local-dir app/models/translategemma-12b-it.Q4_K_M
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q4_K_S.gguf --local-dir app/models/translategemma-12b-it.Q4_K_S
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q5_K_M.gguf --local-dir app/models/translategemma-12b-it.Q5_K_M
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q5_K_S.gguf --local-dir app/models/translategemma-12b-it.Q5_K_S
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q6_K.gguf --local-dir app/models/translategemma-12b-it.Q6_K
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q8_0.gguf --local-dir app/models/translategemma-12b-it.Q8_0

Gemma 27B:

hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.IQ4_XS.gguf --local-dir app/models/translategemma-27b-it.IQ4_XS
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q2_K.gguf --local-dir app/models/translategemma-27b-it.Q2_K
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q3_K_L.gguf --local-dir app/models/translategemma-27b-it.Q3_K_L
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q3_K_M.gguf --local-dir app/models/translategemma-27b-it.Q3_K_M
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q3_K_S.gguf --local-dir app/models/translategemma-27b-it.Q3_K_S
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q4_K_M.gguf --local-dir app/models/translategemma-27b-it.Q4_K_M
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q4_K_S.gguf --local-dir app/models/translategemma-27b-it.Q4_K_S
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q5_K_M.gguf --local-dir app/models/translategemma-27b-it.Q5_K_M
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q5_K_S.gguf --local-dir app/models/translategemma-27b-it.Q5_K_S
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q6_K.gguf --local-dir app/models/translategemma-27b-it.Q6_K
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q8_0.gguf --local-dir app/models/translategemma-27b-it.Q8_0

Running the Application

Using Docker (Recommended)

This is the easiest and recommended way to run the application.

Build the Docker image:

docker build -t fastapi_gemma_translate .
docker build -t grctest/fastapi_gemma_translate .

Run the Docker container: This command runs the container in detached mode (-d) and maps port 8080 on your host to port 8080 in the container.

docker run -d --name ai_container_cpu -p 127.0.0.1:8080:8080 -v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models fastapi_gemma_translate

Alternatively you can pull and run my docker image:

Pull the Docker image:

docker image pull grctest/fastapi_gemma_translate

Run the Docker container: This command runs the container in detached mode (-d) and maps port 8080 on your host to port 8080 in the container.

docker run -d --name ai_container_cpu -p 127.0.0.1:8080:8080 -v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models grctest/fastapi_gemma_translate

The above were for CPU-only mode, if you want Nvidia CUDA GPU support you'll need to use the CudaDockerfile, either by:

Building the Docker image:

docker build -t fastapi_gemma_translate_cuda:legacy -f LegacyCudaDockerfile .
docker build -t fastapi_gemma_translate_cuda:mainstream -f MainstreamCudaDockerfile .
docker build -t fastapi_gemma_translate_cuda:future -f FutureCudaDockerfile .

docker build -t grctest/fastapi_gemma_translate_cuda:legacy -f LegacyCudaDockerfile .
docker build -t grctest/fastapi_gemma_translate_cuda:mainstream -f MainstreamCudaDockerfile .
docker build -t grctest/fastapi_gemma_translate_cuda:future -f FutureCudaDockerfile .

Pulling the Docker image:

docker image pull grctest/fastapi_gemma_translate_cuda:legacy

Then you need to use the GPU flag:

docker run --gpus all -d --name ai_container_gpu -p 127.0.0.1:8080:8080 -v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models -e LLAMA_N_GPU_LAYERS=-1 fastapi_gemma_translate_cuda:legacy

docker run --gpus all -d --name ai_container_cuda -p 127.0.0.1:8080:8080 -v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models -e LLAMA_N_GPU_LAYERS=-1 grctest/fastapi_gemma_translate_cuda:legacy

Note: The C:/Users/username/Desktop/git/fastapi-gemma-translate/_models folder can be replaced by the path you've downloaded the gguf folders+files to.

Container GPU variants:

Legacy: Pascal / 10xx Nvidia cards
Mainstream: Turing to Ada / 20xx, 30xx, 40xx, A100 Nvidia cards
Future: Blackwell / 50xx Nvidia cards

Local Development

For development, you can run the application directly with Uvicorn, which enables auto-reloading.

uvicorn app.main:app --host 0.0.0.0 --port 8080 --reload

Concurrency Tuning (Recommended for CUDA stability)

llama-cpp-python GGUF inference can become unstable under heavy parallel requests on some CUDA setups. This service includes a configurable inference gate shared by /translate, /experimental_translation, and /translate_image.

# safest default: serialize inference
set LLAMA_MAX_CONCURRENT_INFERENCES=1

# max wait in seconds before returning HTTP 503
set LLAMA_INFERENCE_ACQUIRE_TIMEOUT_SECONDS=45

Notes:

Keep LLAMA_MAX_CONCURRENT_INFERENCES=1 for 12B/27B models unless higher values are validated on your hardware.
If queue wait exceeds timeout, the API returns 503 instead of allowing requests to pile up indefinitely.
These can be passed alongside existing runtime env vars like LLAMA_N_GPU_LAYERS.

Example:

docker run --gpus all -d --name ai_container_cuda -p 127.0.0.1:8080:8080 \
    -v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models \
    -e LLAMA_N_GPU_LAYERS=-1 \
    -e LLAMA_MAX_CONCURRENT_INFERENCES=1 \
    -e LLAMA_INFERENCE_ACQUIRE_TIMEOUT_SECONDS=45 \
    grctest/fastapi_gemma_translate_cuda:legacy

API Usage

Once the server is running, you can access the interactive API documentation:

Swagger UI: http://127.0.0.1:8080/docs
ReDoc: http://127.0.0.1:8080/redoc

Model Lifecycle (Required)

Model loading is explicit. Translation endpoints will reject requests unless the requested model is already loaded.

Load a model:

curl -X POST "http://127.0.0.1:8080/model/load" \
    -H "Content-Type: application/json" \
    -d '{"model":"translategemma-4b-it-Q8_0"}'

Load a model with vision support (mmproj can be relative to the model folder or absolute):

curl -X POST "http://127.0.0.1:8080/model/load" \
    -H "Content-Type: application/json" \
    -d '{"model":"translategemma-4b-it-Q8_0","mmproj":"translategemma-4b-it.mmproj-f16.gguf"}'

Check model status:

curl "http://127.0.0.1:8080/model/status"
curl "http://127.0.0.1:8080/model/status?model=translategemma-4b-it-Q8_0"

/model/status includes:

loaded: whether any model is loaded
loading: whether a load is currently in progress
loaded_model: currently loaded model name
vision_enabled: whether the currently loaded model can process images

Text Translation Endpoints

POST /translate (stable locale list)
POST /experimental_translation (stable + experimental locale list)

Both endpoints reject if:

no model is loaded
a model is still loading
the requested model does not match the loaded model

Image Translation Endpoints

This API supports image translation via multipart upload using llama-cpp-python vision chat formatting.

POST /translate_image (stable locale list)

Notes:

Upload images with multipart/form-data as field file.
The image stays local to the server process and is sent to the model as a Base64 data URI.
The loaded model must be vision-enabled.
Vision is enabled only when /model/load is called with an mmproj value.
Image translation requests are rejected if the currently loaded model was not loaded with mmproj.

Example (stable image route):

curl -X POST "http://127.0.0.1:8080/translate_image" \
    -F "file=@C:/path/to/image.jpg" \
    -F "model=translategemma-4b-it-Q8_0" \
    -F "source_lang_code=en" \
    -F "target_lang_code=es" \
    -F "max_new_tokens=200"

Project showcase

MetalGlot

MetalGlot is a private, local-first AI translation desktop app for developers and creators who desire a secure translation workflow without relying on cloud services. Built primarily for software localization and structured i18n content, MetalGlot helps users translate text, locale files, markdown, subtitles, and image-based content on their own hardware while avoiding telemetry, protecting intellectual property, and eliminating recurring per-token costs.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fastapi-gemma-translate

Key Features

Technology Stack

Getting Started

Prerequisites

1. Set Up the Python Environment

Running the Application

Using Docker (Recommended)

Local Development

Concurrency Tuning (Recommended for CUDA stability)

API Usage

Model Lifecycle (Required)

Text Translation Endpoints

Image Translation Endpoints

Project showcase

MetalGlot

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
app		app
.gitignore		.gitignore
Dockerfile		Dockerfile
FutureCudaDockerfile		FutureCudaDockerfile
LICENSE		LICENSE
LegacyCudaDockerfile		LegacyCudaDockerfile
MainstreamCudaDockerfile		MainstreamCudaDockerfile
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

fastapi-gemma-translate

Key Features

Technology Stack

Getting Started

Prerequisites

1. Set Up the Python Environment

Running the Application

Using Docker (Recommended)

Local Development

Concurrency Tuning (Recommended for CUDA stability)

API Usage

Model Lifecycle (Required)

Text Translation Endpoints

Image Translation Endpoints

Project showcase

MetalGlot

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages