Skip to content

How to use "Gemma-4-E4B-it-heretic-GGUF" ? #2175

@TimmyHeart

Description

@TimmyHeart

Hi I'm currently tried every models possible from Qwen3.5-9B-Claude-4.6-Opus-abliterated to c4ai-command-r7b but since the Gemma-4-E4B just released, I found it interesting and I want do some research with it but I couldn't get it to work the way I want it to. When I ask it to translate stuffs or asking to help about any project, it would just spamming stuffs or just analyze feelings, explaining nonsenses. Unlike via Koboldcpp, it worked smoothly and could answering me correctly but I want to use it on python only not Koboldcpp. Any assistant would be appreciated ! Below is my code

import os
import gc
import asyncio
from concurrent.futures import ThreadPoolExecutor
from llama_cpp import Llama
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(file)))
MODEL_PATH = os.path.join(BASE_DIR, "models", "gemma-4-E4B-it-heretic-Q4_K_M.gguf")
conversation_history = []
print("Loading!")
try:
llm = Llama(
model_path=MODEL_PATH,
n_gpu_layers=-1,
n_ctx=6144,
verbose=True,
use_mlock=True,
use_mmap=True,
n_batch=2048,
n_ubatch=1024,
offload_kqv=True,
last_n_tokens_size=128,
mul_mat_q=True,
chat_format="gemma-4",
stop=["<|turn|>"],
cuda_graphs=True,
n_threads=os.cpu_count() or 8,
n_threads_batch=os.cpu_count() or 8,
)
print("Loaded!")
except Exception as e:
print(f"Failed! {e}")
llm = None
def local_ai(text):
global conversation_history
if not llm:
return "Cannt be loaded", None
try:
messages = [
{
"role": "system",
"content": "You're a brillant assistant AI! Provide what in needs!"
}
]
if len(conversation_history) > 4:
conversation_history = conversation_history[-4:]
messages.extend(conversation_history)
messages.append({"role": "user", "content": {text}"})
response = llm.create_chat_completion(
messages=messages,
max_tokens=1024,
temperature=0.1,
top_p=0.9,
top_k=40,
repeat_penalty=1.1,
)
output = response['choices'][0]['message']['content'].strip()
conversation_history.append({"role": "user", "content": {text}"})
conversation_history.append({"role": "assistant", "content": output})
gc.collect()
return output
except Exception as e:
print(f"CRITICAL: {e}")
return "GONE WRONG!", None
async def async(text):
loop = asyncio.get_event_loop()
try:
return await loop.run_in_executor(executor, local_ai, text, lang)
except Exception as e:
print(f"Async Error: {e}")
return None, None

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions