How to use "Gemma-4-E4B-it-heretic-GGUF" ?

Hi I'm currently tried every models possible from Qwen3.5-9B-Claude-4.6-Opus-abliterated to c4ai-command-r7b but since the Gemma-4-E4B just released, I found it interesting and I want do some research with it but I couldn't get it to work the way I want it to. When I ask it to translate stuffs or asking to help about any project, it would just spamming stuffs or just analyze feelings, explaining nonsenses. Unlike via Koboldcpp, it worked smoothly and could answering me correctly but I want to use it on python only not Koboldcpp. Any assistant would be appreciated ! Below is my code 


**import os
import gc
import asyncio
from concurrent.futures import ThreadPoolExecutor
from llama_cpp import Llama
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
MODEL_PATH = os.path.join(BASE_DIR, "models", "gemma-4-E4B-it-heretic-Q4_K_M.gguf")
conversation_history = []
print("Loading!")
try:
    llm = Llama(
        model_path=MODEL_PATH,
        n_gpu_layers=-1, 
        n_ctx=6144,                    
        verbose=True,
        use_mlock=True,
        use_mmap=True,
        n_batch=2048,                 
        n_ubatch=1024,                
        offload_kqv=True,
        last_n_tokens_size=128,
        mul_mat_q=True,
        chat_format="gemma-4",
        stop=["<|turn|>"],
        cuda_graphs=True,              
        n_threads=os.cpu_count() or 8,       
        n_threads_batch=os.cpu_count() or 8,  
    )
    print("Loaded!")
except Exception as e:
    print(f"Failed! {e}")
    llm = None
def local_ai(text):
    global conversation_history
    if not llm: 
        return "Cannt be loaded", None
    try:
        messages = [
            {
                "role": "system",
                "content": "You're a brillant assistant AI! Provide what in needs!"
            }
        ]
        if len(conversation_history) > 4:
            conversation_history = conversation_history[-4:]
        messages.extend(conversation_history)
        messages.append({"role": "user", "content": {text}"})
        response = llm.create_chat_completion(
            messages=messages,
            max_tokens=1024,
            temperature=0.1,
            top_p=0.9,
            top_k=40,
            repeat_penalty=1.1,
        )
        output = response['choices'][0]['message']['content'].strip()
        conversation_history.append({"role": "user", "content": {text}"})
        conversation_history.append({"role": "assistant", "content": output})
        gc.collect() 
        return output
    except Exception as e:
        print(f"CRITICAL: {e}")
        return "GONE WRONG!", None
async def async(text):
    loop = asyncio.get_event_loop()
    try:
        return await loop.run_in_executor(executor, local_ai, text, lang)
    except Exception as e:
        print(f"Async Error: {e}")
        return None, None**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use "Gemma-4-E4B-it-heretic-GGUF" ? #2175

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How to use "Gemma-4-E4B-it-heretic-GGUF" ? #2175

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions