Skip to content

test: add //rs/tests/node:xfs_corruption_repair_test#9585

Draft
basvandijk wants to merge 17 commits intomasterfrom
basvandijk/xfs_corruption_repair_test
Draft

test: add //rs/tests/node:xfs_corruption_repair_test#9585
basvandijk wants to merge 17 commits intomasterfrom
basvandijk/xfs_corruption_repair_test

Conversation

@basvandijk
Copy link
Copy Markdown
Collaborator

@basvandijk basvandijk commented Mar 25, 2026

Based on: #9536.

Adds the //rs/tests/node:xfs_corruption_repair_test which:

  • Boots an IC node.
  • Waits for it to become healthy.
  • Then stops all services that write to /var/lib/ic/data.
  • Unmounts /var/lib/ic/data.
  • Corrupts the XFS log of /dev/mapper/store-shared--data.
  • Kills the node.
  • Starts it up again.
  • Asserts that xfs_repair -L has ran and that the files in /var/lib/ic/data are still there.
  • Waits for the node to become healthy.

That last step fails for some reason I haven't figured out yet.

basvandijk and others added 16 commits March 23, 2026 08:11
The test intermittently fails because vm().kill() (virsh destroy) can leave
the XFS filesystem on /var/lib/ic/data in a corrupted state. When the VM is
restarted with vm().start(), the GuestOS cannot mount the corrupted data
partition and enters emergency mode, making the node permanently unreachable.

The root cause is in setup-shared-data.sh which only checks whether a
filesystem exists (via blkid) but does not verify its integrity. A corrupted
XFS filesystem passes the blkid check but fails to mount.

Fix: After confirming a filesystem exists, attempt a test mount. If the mount
fails (corrupted journal/metadata), run xfs_repair. If repair also fails,
reformat the partition. The IC node recovers its state via state sync, so no
data is permanently lost.

On healthy boots (including those with a dirty but valid XFS journal), the
test mount succeeds instantly and replays the journal, causing no additional
delay.
Replace the test-mount approach with xfs_repair-based detection.
A test mount fails on a dirty-but-valid journal (normal after crash),
triggering unnecessary xfs_repair -L and journal data loss.

Now xfs_repair (without -L) is used as the check: it exits non-zero
with a specific message when the journal just needs replay, which the
real mount via fstab handles automatically. Only on actual corruption
do we escalate to xfs_repair -L, then reformat as last resort.
…es/setup-shared-data.sh

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions github-actions bot added the test label Mar 25, 2026
Base automatically changed from ai/deflake-rejoin_test_large_state-2026-03-22 to master March 30, 2026 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant