Skip to content

Design eval#137

Open
haroonc wants to merge 7 commits intomainfrom
design-eval
Open

Design eval#137
haroonc wants to merge 7 commits intomainfrom
design-eval

Conversation

@haroonc
Copy link
Copy Markdown
Contributor

@haroonc haroonc commented Apr 14, 2026

No description provided.

kmontg and others added 7 commits April 13, 2026 14:42
Initial set of evals that will be run offline in a matrix of test repos
* FIX: Add GEMINI.md root level instructions

We can now have a root level GEMINI.md and per-task GEMINI.md
files if instructions are needed outside the prompt. Existing
evals have been updated to use this. We might want to add a feature
to skillgrade that suports 'default' workspace files, which would
be pretty easy to add. That would reduce some of the boilerplate.

TESTED=ran evals manually, all passed

* Updating workspace destination for per-task GEMINI.md

* Updating typo
This attempts to use a bit stronger wording that more specifically
targets instructions in the system prompt. The intent is to
help Gemini fail faster if information can't be found quickly
for proceeding with the eval case.

See https://geminicli.com/docs/cli/system-prompt/ for details on
how t o materialize the system prompt used (including injected
GEMINI.md instructions).
@haroonc haroonc assigned haroonc and kmontg and unassigned kmontg and haroonc Apr 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an automated evaluation framework for CI/CD tasks, including a multi-stage Dockerfile, task configurations in eval.yaml, and various validation scripts. Feedback identifies missing runtime dependencies in the Docker image, an uninitialized variable in the Cloud Build validation script, and an incorrect model name in the evaluation settings. Additionally, recommendations were made to use a stable base image for builds and to address security risks associated with piping remote scripts directly to the shell.

Comment thread Dockerfile.eval
RUN apt-get update && apt-get install -y \
curl \
procps \
python3 \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The evaluation scripts have additional runtime dependencies that are currently missing from this image: scripts/validate-cloudbuild-config.sh requires the python3-yaml package, and scripts/validate-dockerfile.sh requires the docker CLI to perform syntax checks.

    python3 \
    python3-yaml \
    docker.io \

# Check for required environment variables
MISSING_VARS=()
[[ -z "$APP_DIR" ]] && MISSING_VARS+=("APP_DIR")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable OUTPUT_DIR is used in multiple places (e.g., lines 104, 107, 158) but is never initialized within the script. If it remains empty, the script will attempt to write to the root directory (/cloudbuild_check_output), which will likely fail due to permission restrictions. Providing a default value ensures the script works correctly across different environments.

Suggested change
OUTPUT_DIR="${OUTPUT_DIR:-.}"

Comment thread Dockerfile.eval
@@ -0,0 +1,67 @@
# Stage 1: Build the MCP Server
FROM debian:sid AS builder
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using debian:sid (unstable) can lead to non-deterministic builds as package versions change frequently. Since the build.sh script requires Go 1.24+ (which is not available in stable Debian repositories), it is recommended to use the official Golang image based on a stable Debian release for better reliability.

FROM golang:1.24-bookworm AS builder

Comment thread Dockerfile.eval
&& rm -rf /var/lib/apt/lists/*

# gcloud
RUN curl -sSL https://sdk.cloud.google.com | bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Piping curl directly to bash is a security risk as it executes unverified code from the internet. It is safer to download the script, verify its integrity, and then execute it, or use the official Google Cloud SDK package repository for Debian.

Comment thread eval.yaml
trials: 1 # TODO increase trials
timeout: 300
threshold: 0.8 # For --ci mode
grader_model: gemini-3-flash-preview # default LLM grader model
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model gemini-3-flash-preview is not a recognized public Gemini model name. You likely intended to use a currently available model such as gemini-1.5-flash or gemini-1.5-pro.

  grader_model: gemini-1.5-flash # default LLM grader model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants