Skip to content

RATIS-244. Skip snapshot file if corresponding MD5 file is missing#1320

Merged
szetszwo merged 9 commits intoapache:masterfrom
spacemonkd:RATIS-244
Mar 28, 2026
Merged

RATIS-244. Skip snapshot file if corresponding MD5 file is missing#1320
szetszwo merged 9 commits intoapache:masterfrom
spacemonkd:RATIS-244

Conversation

@spacemonkd
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Snapshot is not an atomic operation and in some cases there might be corrupted snapshots.
Every snapshot has an associated MD5 file with it, if this file is missing we can skip the snapshot.

This PR adds the check for the existence of MD5 file and skips any snapshot which doesn't have it.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-244

How was this patch tested?

Patch was tested manually by running the unit tests.

Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devabhishekpal , thanks for working on this! Just have a minor comment inlined.

@szetszwo
Copy link
Copy Markdown
Contributor

@devabhishekpal , could you also add some tests to TestRaftStorage?

@spacemonkd spacemonkd marked this pull request as ready for review November 21, 2025 19:20
@szetszwo
Copy link
Copy Markdown
Contributor

@devabhishekpal , thanks for the update! The change looks good but there are test failures. Please take a look.

@spacemonkd
Copy link
Copy Markdown
Contributor Author

It seems the tests are failing because in findLatestSnapshot() we are doing:

getSingleFileSnapshotInfos(dir, true).iterator()

Now if we create an initial snapshot with md5 file, this will return null from the check:

if (!i.hasNext()) {
      return null;
}

We now create a second snapshot with no MD5 hash, this will still return null as the iterator still has one element.
Hence when we run a scenario as below:

try {
      createSnapshot(simpleStateMachineStorage, 1, 100, true);
      simpleStateMachineStorage.loadLatestSnapshot();

      createSnapshot(simpleStateMachineStorage, 1, 200, false);
      simpleStateMachineStorage.loadLatestSnapshot();

      SingleFileSnapshotInfo latest = simpleStateMachineStorage.getLatestSnapshot();
      Assertions.assertNotNull(latest);
}

latest is always returned as null even though there are two snapshots present.
Previous behaviour would have been that it would have returned one snapshot.

This is also causing gRPC test failures as SimpleStateMachineStorage.findLatestSnapshot() filters out every snapshot that lacks an MD5 file as we are passing (requireMd5 == true). So on the follower that just copied the snapshot, findLatestSnapshot now returns null - loadLatestSnapshot() never updates the cache, and the snapshot install is effectively ignored.
Hence follower never recognises the installed snapshot, it fails to catch up and doesn’t participate in the new configuration. setConfiguration therefore keeps retrying and times-out

@szetszwo do you think we might need to change the test case? Or should the behaviour be changed?

@szetszwo
Copy link
Copy Markdown
Contributor

latest is always returned as null even though there are two snapshots present.

It was a bug previously that it was using filename but not path. Is it still happening after the fix?

@spacemonkd
Copy link
Copy Markdown
Contributor Author

Yup, we are still seeing the same issue

@szetszwo
Copy link
Copy Markdown
Contributor

... we are still seeing the same issue

Could you debug it and find out why?

@szetszwo
Copy link
Copy Markdown
Contributor

createSnapshot(simpleStateMachineStorage, 1, 100, true);

@devabhishekpal , your test creating fake MD5 file is causing exception

2025-11-29 13:43:15,570 [main] WARN  impl.SimpleStateMachineStorage (SimpleStateMachineStorage.java:loadLatestSnapshot(262)) - Failed to updateLatestSnapshot from target/test/data/81dc258c/TestRaftStorage/TestRaftStorage.testCreateSnapshot/sm
java.io.IOException: Invalid MD5 file target/test/data/81dc258c/TestRaftStorage/TestRaftStorage.testCreateSnapshot/sm/snapshot.1_100.md5: the content "null" does not match the pattern ([0-9a-f]{32}) [ *](.+)
	at org.apache.ratis.util.MD5FileUtil.lambda$readStoredMd5ForFile$0(MD5FileUtil.java:99)
	at java.base/java.util.Optional.orElseThrow(Optional.java:408)
	at org.apache.ratis.util.MD5FileUtil.readStoredMd5ForFile(MD5FileUtil.java:98)
	at org.apache.ratis.statemachine.impl.SimpleStateMachineStorage.findLatestSnapshot(SimpleStateMachineStorage.java:223)
	at org.apache.ratis.statemachine.impl.SimpleStateMachineStorage.loadLatestSnapshot(SimpleStateMachineStorage.java:258)
	at org.apache.ratis.server.storage.TestRaftStorage.testCreateSnapshot(TestRaftStorage.java:101)

@spacemonkd spacemonkd requested a review from szetszwo March 7, 2026 16:20
@spacemonkd
Copy link
Copy Markdown
Contributor Author

@szetszwo could you take another look at this PR?

@spacemonkd
Copy link
Copy Markdown
Contributor Author

@szetszwo I have fixed all the test issues and modified it so that if there is only one snapshot and that has a missing MD5 it will select that snapshot.
What do you think?

Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spacemonkd , Thanks for the update and sorry for being late! Please see the comments inlined and also https://issues.apache.org/jira/secure/attachment/13081442/1320_review.patch

* The objects of this class are immutable.
*/
public class SingleFileSnapshotInfo extends FileListSnapshotInfo {
private final Boolean hasMd5; // Whether the snapshot file has a corresponding MD5 file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new field is not needed. Use FileInfo to determine:

  /** @return true iff the MD5 exists. */
  public boolean hasMd5() {
    return getFile().getFileDigest() != null;
  }

}

static List<SingleFileSnapshotInfo> getSingleFileSnapshotInfos(Path dir) throws IOException {
static List<SingleFileSnapshotInfo> getSingleFileSnapshotInfos(Path dir, boolean requireMd5) throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requireMd5 is always false. Let's remove it.

Comment on lines +99 to +100
final FileInfo fileInfo = new FileInfo(path, null); //No FileDigest here.
infos.add(new SingleFileSnapshotInfo(fileInfo, term, index));
infos.add(new SingleFileSnapshotInfo(fileInfo, term, index, hasMd5));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's read the md5 file here:

            final MD5Hash md5 = MD5FileUtil.readStoredMd5ForFile(path.toFile());
            final FileInfo fileInfo = new FileInfo(path, md5);
            infos.add(new SingleFileSnapshotInfo(fileInfo, term, index));

Comment on lines +212 to +232
// Track the latest snapshot without MD5 as fallback.
SingleFileSnapshotInfo fallbackWithoutMd5 = null;
for (SingleFileSnapshotInfo latest : infos) {
if (!latest.hasMd5()) {
if (fallbackWithoutMd5 == null) {
fallbackWithoutMd5 = latest;
}
continue;
}

final Path path = latest.getFile().getPath();
try {
final MD5Hash md5 = MD5FileUtil.readStoredMd5ForFile(path.toFile());
if (md5 == null) {
LOG.warn("Snapshot file {} has missing MD5 file.", latest);
continue;
}
final FileInfo info = new FileInfo(path, md5);
return new SingleFileSnapshotInfo(info, latest.getTerm(), latest.getIndex(), true);
} catch (IOException e) {
LOG.warn("Failed to read MD5 for snapshot file {}, trying older snapshots.", latest, e);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified since either all snapshots do not have MD5 or at least one has.

  static SingleFileSnapshotInfo findLatestSnapshot(Path dir) throws IOException {
    final List<SingleFileSnapshotInfo> infos = getSingleFileSnapshotInfos(dir);
    if (infos.isEmpty()) {
      return null;
    }
    infos.sort(Comparator.comparing(SingleFileSnapshotInfo::getIndex).reversed());

    for (SingleFileSnapshotInfo latest : infos) {
      if (latest.hasMd5()) {
        return latest;
      }
    }

    return infos.get(0); // all snapshots do not have MD5
  }

}

final List<SingleFileSnapshotInfo> allSnapshotFiles = getSingleFileSnapshotInfos(stateMachineDir.toPath());
// Fetch all the snapshot files irrespective of whether they have an MD5 file or not
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move it as javadoc of getSingleFileSnapshotInfos(..).

@spacemonkd
Copy link
Copy Markdown
Contributor Author

@szetszwo thanks a lot for your continuous inputs. I have addressed the changes in the latest commit

@spacemonkd spacemonkd requested a review from szetszwo March 27, 2026 07:09
Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@szetszwo szetszwo merged commit 96dc998 into apache:master Mar 28, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants