Design eval by haroonc · Pull Request #137 · gemini-cli-extensions/cicd

haroonc · 2026-04-14T15:11:36Z

No description provided.

Initial set of evals that will be run offline in a matrix of test repos

* FIX: Add GEMINI.md root level instructions We can now have a root level GEMINI.md and per-task GEMINI.md files if instructions are needed outside the prompt. Existing evals have been updated to use this. We might want to add a feature to skillgrade that suports 'default' workspace files, which would be pretty easy to add. That would reduce some of the boilerplate. TESTED=ran evals manually, all passed * Updating workspace destination for per-task GEMINI.md * Updating typo

This attempts to use a bit stronger wording that more specifically targets instructions in the system prompt. The intent is to help Gemini fail faster if information can't be found quickly for proceeding with the eval case. See https://geminicli.com/docs/cli/system-prompt/ for details on how t o materialize the system prompt used (including injected GEMINI.md instructions).

gemini-code-assist

Code Review

This pull request introduces an automated evaluation framework for CI/CD tasks, including a multi-stage Dockerfile, task configurations in eval.yaml, and various validation scripts. Feedback identifies missing runtime dependencies in the Docker image, an uninitialized variable in the Cloud Build validation script, and an incorrect model name in the evaluation settings. Additionally, recommendations were made to use a stable base image for builds and to address security risks associated with piping remote scripts directly to the shell.

gemini-code-assist · 2026-04-14T15:16:13Z

+RUN apt-get update && apt-get install -y \
+    curl \
+    procps \
+    python3 \


The evaluation scripts have additional runtime dependencies that are currently missing from this image: scripts/validate-cloudbuild-config.sh requires the python3-yaml package, and scripts/validate-dockerfile.sh requires the docker CLI to perform syntax checks.

python3 \ python3-yaml \ docker.io \

gemini-code-assist · 2026-04-14T15:16:13Z

+# Check for required environment variables
+MISSING_VARS=()
+[[ -z "$APP_DIR" ]] && MISSING_VARS+=("APP_DIR")
+


The variable OUTPUT_DIR is used in multiple places (e.g., lines 104, 107, 158) but is never initialized within the script. If it remains empty, the script will attempt to write to the root directory (/cloudbuild_check_output), which will likely fail due to permission restrictions. Providing a default value ensures the script works correctly across different environments.

Suggested change

OUTPUT_DIR="${OUTPUT_DIR:-.}"

gemini-code-assist · 2026-04-14T15:16:13Z

@@ -0,0 +1,67 @@
+# Stage 1: Build the MCP Server
+FROM debian:sid AS builder


Using debian:sid (unstable) can lead to non-deterministic builds as package versions change frequently. Since the build.sh script requires Go 1.24+ (which is not available in stable Debian repositories), it is recommended to use the official Golang image based on a stable Debian release for better reliability.

FROM golang:1.24-bookworm AS builder

gemini-code-assist · 2026-04-14T15:16:13Z

+    && rm -rf /var/lib/apt/lists/*
+
+# gcloud
+RUN curl -sSL https://sdk.cloud.google.com | bash


Piping curl directly to bash is a security risk as it executes unverified code from the internet. It is safer to download the script, verify its integrity, and then execute it, or use the official Google Cloud SDK package repository for Debian.

gemini-code-assist · 2026-04-14T15:16:13Z

+  trials: 1 # TODO increase trials
+  timeout: 300
+  threshold: 0.8 # For --ci mode
+  grader_model: gemini-3-flash-preview # default LLM grader model


The model gemini-3-flash-preview is not a recognized public Gemini model name. You likely intended to use a currently available model such as gemini-1.5-flash or gemini-1.5-pro.

grader_model: gemini-1.5-flash # default LLM grader model

kmontg and others added 7 commits April 13, 2026 14:42

Add initial eval set (#126)

37c387d

Initial set of evals that will be run offline in a matrix of test repos

Update evals with GEMINI.md instructions

93ac79a

Make use of agentWorkingDir to specify directory agent runs from (#128)

981ccb3

Update evals with default workspace for shared GEMINI.md (#135)

792ab27

fix: add eval for simple design with terraform

a445029

haroonc assigned haroonc and kmontg and unassigned kmontg and haroonc Apr 14, 2026

gemini-code-assist bot reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design eval#137

Design eval#137
haroonc wants to merge 7 commits intomainfrom
design-eval

haroonc commented Apr 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,67 @@
		# Stage 1: Build the MCP Server
		FROM debian:sid AS builder

Conversation

haroonc commented Apr 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants