-
Notifications
You must be signed in to change notification settings - Fork 1.4k
How to use "Gemma-4-E4B-it-heretic-GGUF" ? #2175
Description
Hi I'm currently tried every models possible from Qwen3.5-9B-Claude-4.6-Opus-abliterated to c4ai-command-r7b but since the Gemma-4-E4B just released, I found it interesting and I want do some research with it but I couldn't get it to work the way I want it to. When I ask it to translate stuffs or asking to help about any project, it would just spamming stuffs or just analyze feelings, explaining nonsenses. Unlike via Koboldcpp, it worked smoothly and could answering me correctly but I want to use it on python only not Koboldcpp. Any assistant would be appreciated ! Below is my code
import os
import gc
import asyncio
from concurrent.futures import ThreadPoolExecutor
from llama_cpp import Llama
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(file)))
MODEL_PATH = os.path.join(BASE_DIR, "models", "gemma-4-E4B-it-heretic-Q4_K_M.gguf")
conversation_history = []
print("Loading!")
try:
llm = Llama(
model_path=MODEL_PATH,
n_gpu_layers=-1,
n_ctx=6144,
verbose=True,
use_mlock=True,
use_mmap=True,
n_batch=2048,
n_ubatch=1024,
offload_kqv=True,
last_n_tokens_size=128,
mul_mat_q=True,
chat_format="gemma-4",
stop=["<|turn|>"],
cuda_graphs=True,
n_threads=os.cpu_count() or 8,
n_threads_batch=os.cpu_count() or 8,
)
print("Loaded!")
except Exception as e:
print(f"Failed! {e}")
llm = None
def local_ai(text):
global conversation_history
if not llm:
return "Cannt be loaded", None
try:
messages = [
{
"role": "system",
"content": "You're a brillant assistant AI! Provide what in needs!"
}
]
if len(conversation_history) > 4:
conversation_history = conversation_history[-4:]
messages.extend(conversation_history)
messages.append({"role": "user", "content": {text}"})
response = llm.create_chat_completion(
messages=messages,
max_tokens=1024,
temperature=0.1,
top_p=0.9,
top_k=40,
repeat_penalty=1.1,
)
output = response['choices'][0]['message']['content'].strip()
conversation_history.append({"role": "user", "content": {text}"})
conversation_history.append({"role": "assistant", "content": output})
gc.collect()
return output
except Exception as e:
print(f"CRITICAL: {e}")
return "GONE WRONG!", None
async def async(text):
loop = asyncio.get_event_loop()
try:
return await loop.run_in_executor(executor, local_ai, text, lang)
except Exception as e:
print(f"Async Error: {e}")
return None, None