Skip to content

fix: AE-2322: network volume size bounds, forbid extra args#290

Merged
jhcipar merged 8 commits intomainfrom
jhcipar/AE-2322/network-volume-validation
Mar 26, 2026
Merged

fix: AE-2322: network volume size bounds, forbid extra args#290
jhcipar merged 8 commits intomainfrom
jhcipar/AE-2322/network-volume-validation

Conversation

@jhcipar
Copy link
Contributor

@jhcipar jhcipar commented Mar 25, 2026

validation for network volume config:

  • name cannot be an empty string or whitespace-only string
  • forbid extra args (should probably be on base resource config, but i think it's fine to just have it on network volume for now)
  • limit size so failures happen at config time
  • reduce default network volume size

id: Optional[str] = Field(default=None)
name: Optional[str] = None
size: Optional[int] = Field(default=100, gt=0) # Size in GB
size: Optional[int] = Field(default=100, gt=0, le=4096) # Size in GB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default should probably be lower, like 20GB? 10GB? Also I believe the minimum is 10GB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's 10gb, updated. 20gb seems sensible for a default

@promptless
Copy link

promptless bot commented Mar 25, 2026

📝 Documentation updates detected!

New suggestion: Document valid size range for network volumes


Tip: Configure how Promptless handles changelogs in Agent Settings 📋

Copy link
Contributor

@KAJdev KAJdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just fix the assert in the test looks good

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds stricter validation for NetworkVolume configuration so invalid volume definitions fail fast at model-construction/config time.

Changes:

  • Enforce NetworkVolume name is not empty/whitespace-only and forbid unknown fields via Pydantic config.
  • Introduce size bounds (min 10GB, max 4096GB) and change the default size to 20GB.
  • Update/unit-extend tests and manifest expectations to reflect the new validation and defaults.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/test_p2_remaining_gaps.py Updates a gap-test expectation to reflect new name validation behavior and size constraint wording.
tests/unit/test_p2_gaps.py Updates VOL-* tests for new min-size constraint and new default size.
tests/unit/resources/test_network_volume.py Adds focused unit tests for empty/whitespace names, size bounds, and unknown-field rejection.
tests/unit/cli/commands/build_utils/test_manifest.py Updates manifest extraction test to assert the new default volume size.
src/runpod_flash/core/resources/network_volume.py Implements extra="forbid", adds name validator, and applies min/max size constraints plus default size change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #290 — fix: AE-2322: network volume size bounds, forbid extra args

Repo: runpod/flash
Severity: MEDIUM
Key findings: Volume creation will fail silently after this PR ships — the volume gets created in RunPod but the SDK throws an error, leaving users with orphaned paid resources.


1. Size bounds and default change

The 10 GB minimum, 4 TB ceiling, and 20 GB default are correct and well-motivated. The unit test coverage for the new bounds is thorough.

2. Bug: volume creation broken end-to-end (high confidence)

The unit tests all pass with mocked API responses, but live volume creation will fail. When RunPod actually creates a network volume, the API response includes fields the SDK doesn't declare (createdAt, podCount, isPublic, mountPath, etc.). This PR makes the SDK reject any unknown fields as invalid — so the SDK throws a validation error immediately after the volume is created.

The failure scenario: user runs flash deploy, the network volume gets created and charged in RunPod, the SDK crashes, and the deploy appears to have failed. The user retries, potentially creating multiple orphaned volumes. There is no E2E test covering live volume creation — the unit tests mock the API response and only return the four fields the SDK expects, which is why this isn't caught.

This needs a live volume creation test before shipping, or the model needs to be changed to tolerate extra fields from the API response.

3. Issue: orphaned volumes cannot be removed via the CLI

flash undeploy does not delete network volumes — the undeploy path for NetworkVolume raises NotImplementedError and tells users to delete via the RunPod UI or API. A user who hits the creation bug (section 2) and retries will accumulate orphaned paid volumes with no CLI path to clean them up. They have to find and delete them manually in the RunPod console. This makes the section 2 failure significantly worse than it appears.

4. No documentation on network volume lifecycle

There is no user-facing documentation covering: how to find and reuse an existing volume by ID, what happens to a volume across redeploys, how to delete a volume (the CLI cannot do it — users must use the RunPod UI or API, which is not mentioned anywhere), or what to do when a volume is left orphaned by a failed deploy. The SDK reference shows the constructor syntax but not the operational lifecycle. This PR adds size constraints and a new default — none of those appear in any doc either.

Given that volumes are persistent paid resources and this PR makes the failure mode worse (creation succeeds but SDK crashes), documenting the lifecycle — especially deletion and recovery — should ship alongside this fix.

Nit: users with existing endpoints will see unexpected redeployment

Any endpoint currently configured with the old 100 GB volume default will appear as "drifted" on next flash deploy and trigger a full redeploy. Not a bug introduced by this PR, but users will be surprised. Worth a note in the release.

Verdict

NEEDS WORK — the size validation is correct but the change breaks live volume creation, and orphaned volumes from failed deploys cannot be cleaned up from the CLI. Needs a live test or a code fix before merging.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction to prior review

Section 2 (Bug: volume creation broken end-to-end) was wrong and should be disregarded.

I claimed the RunPod API returns extra fields (createdAt, podCount, etc.) that would cause extra="forbid" to reject the response. I verified this by calling GET /v1/networkvolumes directly — the API returns only {id, name, dataCenterId, size}, which matches the model exactly. extra="forbid" is safe. There is no breakage here.

The Nit about unexpected redeployment was also wrong — the default size stays 100 GB.

Sections 3 and 4 from the original review remain valid (CLI cannot delete volumes, lifecycle not documented). Updated verdict: PASS WITH NITS.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

@jhcipar jhcipar merged commit 6cc7ec2 into main Mar 26, 2026
4 checks passed
@jhcipar jhcipar deleted the jhcipar/AE-2322/network-volume-validation branch March 26, 2026 23:49
@promptless
Copy link

promptless bot commented Mar 26, 2026

📝 Documentation updates detected!

Updated existing suggestion: Document valid size range for network volumes


Tip: Assign suggestions to team members in the Promptless dashboard to claim work 👥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants