NAS backup: resume paused VM on backup failure and fix missing exit#12822
Open
jmsperu wants to merge 1 commit intoapache:mainfrom
Open
NAS backup: resume paused VM on backup failure and fix missing exit#12822jmsperu wants to merge 1 commit intoapache:mainfrom
jmsperu wants to merge 1 commit intoapache:mainfrom
Conversation
When a NAS backup job fails (e.g. due to backup storage being full or I/O errors), the VM may remain indefinitely paused because: 1. The cleanup() function never checks or resumes the VM's paused state that was set by virsh backup-begin during the push backup operation. 2. The 'Failed' case in the backup job monitoring loop calls cleanup() but lacks an 'exit' statement, causing an infinite loop where the script repeatedly detects the failed job and calls cleanup(). 3. Similarly, backup_stopped_vm() calls cleanup() on qemu-img convert failure but does not exit, allowing the loop to continue with subsequent disks despite the failure. This fix: - Adds VM state detection and resume to cleanup(), ensuring the VM is always resumed if found in a paused state during error handling - Adds missing 'exit 1' after cleanup() in the Failed backup job case to prevent the infinite monitoring loop - Adds missing 'exit 1' after cleanup() in backup_stopped_vm() on qemu-img convert failure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #12822 +/- ##
============================================
- Coverage 17.95% 17.94% -0.01%
+ Complexity 16259 16258 -1
============================================
Files 5954 5954
Lines 534838 534838
Branches 65423 65423
============================================
- Hits 96010 95991 -19
- Misses 428053 428074 +21
+ Partials 10775 10773 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
weizhouapache
approved these changes
Mar 17, 2026
Member
weizhouapache
left a comment
There was a problem hiding this comment.
code lgtm
not tested yet
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #12821 — KVM VMs remain indefinitely paused when NAS backup job fails.
When
virsh backup-beginexecutes a push backup, QEMU pauses the domain for a consistent snapshot. If the backup write fails (e.g. NFS storage full),nasbackup.shcallscleanup()but:cleanup()only removes files and unmountsexitaftercleanup()in theFailedcase causes an infinite loopbackup_stopped_vm()—qemu-img convertfailure callscleanup()but continues processingChanges
cleanup(): Added VM state detection viavirsh domstateand automaticvirsh resumeif the VM is found paused, ensuring the VM is always resumed during error handlingbackup_running_vm(): Addedexit 1aftercleanup()in theFailedbackup job case to terminate the infinite monitoring loopbackup_stopped_vm(): Addedexit 1aftercleanup()onqemu-img convertfailureEvidence
In production, NFS backup storage filling to 100% caused 8 VMs to become paused simultaneously across 3 KVM hosts. Some VMs remained paused for over 6 hours. CloudStack UI showed them as "Running" while they were actually paused at the KVM level, requiring manual
virsh resumeon each host.Note
The pattern of checking and resuming paused VMs already exists in the Java layer — see
LibvirtBackupSnapshotCommandWrapper.java:186-188andKVMStorageProcessor.java:2268-2272— but was missing from the shell script that actually manages the backup lifecycle.Test plan
cleanup()correctly resumes VM before removing temp files and unmounting🤖 Generated with Claude Code