test: add //rs/tests/node:xfs_corruption_repair_test by basvandijk · Pull Request #9585 · dfinity/ic

basvandijk · 2026-03-25T14:21:06Z

Based on: #9536.

Adds the //rs/tests/node:xfs_corruption_repair_test which:

Boots an IC node.
Waits for it to become healthy.
Then stops all services that write to /var/lib/ic/data.
Unmounts /var/lib/ic/data.
Corrupts the XFS log of /dev/mapper/store-shared--data.
Kills the node.
Starts it up again.
Asserts that xfs_repair -L has ran and that the files in /var/lib/ic/data are still there.
Waits for the node to become healthy.

That last step fails for some reason I haven't figured out yet.

The test intermittently fails because vm().kill() (virsh destroy) can leave the XFS filesystem on /var/lib/ic/data in a corrupted state. When the VM is restarted with vm().start(), the GuestOS cannot mount the corrupted data partition and enters emergency mode, making the node permanently unreachable. The root cause is in setup-shared-data.sh which only checks whether a filesystem exists (via blkid) but does not verify its integrity. A corrupted XFS filesystem passes the blkid check but fails to mount. Fix: After confirming a filesystem exists, attempt a test mount. If the mount fails (corrupted journal/metadata), run xfs_repair. If repair also fails, reformat the partition. The IC node recovers its state via state sync, so no data is permanently lost. On healthy boots (including those with a dirty but valid XFS journal), the test mount succeeds instantly and replays the journal, causing no additional delay.

Replace the test-mount approach with xfs_repair-based detection. A test mount fails on a dirty-but-valid journal (normal after crash), triggering unnecessary xfs_repair -L and journal data loss. Now xfs_repair (without -L) is used as the check: it exits non-zero with a specific message when the journal just needs replay, which the real mount via fstab handles automatically. Only on actual corruption do we escalate to xfs_repair -L, then reformat as last resort.

…es/setup-shared-data.sh Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

basvandijk and others added 16 commits March 23, 2026 08:11

address review: EXIT trap, log mount errors, graduated xfs_repair

5c6e83a

remove redundant rmdir, EXIT trap handles cleanup

45de036

make umount non-fatal to avoid aborting under set -e

e5a2e26

remove -x

5409538

Update ic-os/components/upgrade/shared-resources/setup-shared-resourc…

ef30c66

…es/setup-shared-data.sh Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

improved logging

2345b9a

reduce nesting

4022378

.

43815f8

abstract formatting device

9e1b27c

don't reformat on xfs_repair -L failure

f4795b1

Revert back to doing a test mount

185619f

fix

141d349

fix

b66442e

test: add //rs/tests/node:xfs_corruption_repair_test

45c0b71

github-actions bot added the test label Mar 25, 2026

Automatically updated Cargo*.lock

31a0186

basvandijk mentioned this pull request Mar 25, 2026

fix: deflake //rs/tests/message_routing:rejoin_test_large_state #9536

Merged

Base automatically changed from ai/deflake-rejoin_test_large_state-2026-03-22 to master March 30, 2026 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add //rs/tests/node:xfs_corruption_repair_test#9585

test: add //rs/tests/node:xfs_corruption_repair_test#9585
basvandijk wants to merge 17 commits intomasterfrom
basvandijk/xfs_corruption_repair_test

basvandijk commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

basvandijk commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

basvandijk commented Mar 25, 2026 •

edited

Loading