RATIS-244. Skip snapshot file if corresponding MD5 file is missing#1320
RATIS-244. Skip snapshot file if corresponding MD5 file is missing#1320szetszwo merged 9 commits intoapache:masterfrom
Conversation
szetszwo
left a comment
There was a problem hiding this comment.
@devabhishekpal , thanks for working on this! Just have a minor comment inlined.
|
@devabhishekpal , could you also add some tests to |
|
@devabhishekpal , thanks for the update! The change looks good but there are test failures. Please take a look. |
|
It seems the tests are failing because in Now if we create an initial snapshot with md5 file, this will return null from the check: We now create a second snapshot with no MD5 hash, this will still return null as the iterator still has one element. latest is always returned as null even though there are two snapshots present. This is also causing gRPC test failures as @szetszwo do you think we might need to change the test case? Or should the behaviour be changed? |
It was a bug previously that it was using |
|
Yup, we are still seeing the same issue |
Could you debug it and find out why? |
@devabhishekpal , your test creating fake MD5 file is causing exception 2025-11-29 13:43:15,570 [main] WARN impl.SimpleStateMachineStorage (SimpleStateMachineStorage.java:loadLatestSnapshot(262)) - Failed to updateLatestSnapshot from target/test/data/81dc258c/TestRaftStorage/TestRaftStorage.testCreateSnapshot/sm
java.io.IOException: Invalid MD5 file target/test/data/81dc258c/TestRaftStorage/TestRaftStorage.testCreateSnapshot/sm/snapshot.1_100.md5: the content "null" does not match the pattern ([0-9a-f]{32}) [ *](.+)
at org.apache.ratis.util.MD5FileUtil.lambda$readStoredMd5ForFile$0(MD5FileUtil.java:99)
at java.base/java.util.Optional.orElseThrow(Optional.java:408)
at org.apache.ratis.util.MD5FileUtil.readStoredMd5ForFile(MD5FileUtil.java:98)
at org.apache.ratis.statemachine.impl.SimpleStateMachineStorage.findLatestSnapshot(SimpleStateMachineStorage.java:223)
at org.apache.ratis.statemachine.impl.SimpleStateMachineStorage.loadLatestSnapshot(SimpleStateMachineStorage.java:258)
at org.apache.ratis.server.storage.TestRaftStorage.testCreateSnapshot(TestRaftStorage.java:101) |
|
@szetszwo could you take another look at this PR? |
|
@szetszwo I have fixed all the test issues and modified it so that if there is only one snapshot and that has a missing MD5 it will select that snapshot. |
szetszwo
left a comment
There was a problem hiding this comment.
@spacemonkd , Thanks for the update and sorry for being late! Please see the comments inlined and also https://issues.apache.org/jira/secure/attachment/13081442/1320_review.patch
| * The objects of this class are immutable. | ||
| */ | ||
| public class SingleFileSnapshotInfo extends FileListSnapshotInfo { | ||
| private final Boolean hasMd5; // Whether the snapshot file has a corresponding MD5 file |
There was a problem hiding this comment.
The new field is not needed. Use FileInfo to determine:
/** @return true iff the MD5 exists. */
public boolean hasMd5() {
return getFile().getFileDigest() != null;
}| } | ||
|
|
||
| static List<SingleFileSnapshotInfo> getSingleFileSnapshotInfos(Path dir) throws IOException { | ||
| static List<SingleFileSnapshotInfo> getSingleFileSnapshotInfos(Path dir, boolean requireMd5) throws IOException { |
There was a problem hiding this comment.
requireMd5 is always false. Let's remove it.
| final FileInfo fileInfo = new FileInfo(path, null); //No FileDigest here. | ||
| infos.add(new SingleFileSnapshotInfo(fileInfo, term, index)); | ||
| infos.add(new SingleFileSnapshotInfo(fileInfo, term, index, hasMd5)); |
There was a problem hiding this comment.
Let's read the md5 file here:
final MD5Hash md5 = MD5FileUtil.readStoredMd5ForFile(path.toFile());
final FileInfo fileInfo = new FileInfo(path, md5);
infos.add(new SingleFileSnapshotInfo(fileInfo, term, index));| // Track the latest snapshot without MD5 as fallback. | ||
| SingleFileSnapshotInfo fallbackWithoutMd5 = null; | ||
| for (SingleFileSnapshotInfo latest : infos) { | ||
| if (!latest.hasMd5()) { | ||
| if (fallbackWithoutMd5 == null) { | ||
| fallbackWithoutMd5 = latest; | ||
| } | ||
| continue; | ||
| } | ||
|
|
||
| final Path path = latest.getFile().getPath(); | ||
| try { | ||
| final MD5Hash md5 = MD5FileUtil.readStoredMd5ForFile(path.toFile()); | ||
| if (md5 == null) { | ||
| LOG.warn("Snapshot file {} has missing MD5 file.", latest); | ||
| continue; | ||
| } | ||
| final FileInfo info = new FileInfo(path, md5); | ||
| return new SingleFileSnapshotInfo(info, latest.getTerm(), latest.getIndex(), true); | ||
| } catch (IOException e) { | ||
| LOG.warn("Failed to read MD5 for snapshot file {}, trying older snapshots.", latest, e); |
There was a problem hiding this comment.
This can be simplified since either all snapshots do not have MD5 or at least one has.
static SingleFileSnapshotInfo findLatestSnapshot(Path dir) throws IOException {
final List<SingleFileSnapshotInfo> infos = getSingleFileSnapshotInfos(dir);
if (infos.isEmpty()) {
return null;
}
infos.sort(Comparator.comparing(SingleFileSnapshotInfo::getIndex).reversed());
for (SingleFileSnapshotInfo latest : infos) {
if (latest.hasMd5()) {
return latest;
}
}
return infos.get(0); // all snapshots do not have MD5
}| } | ||
|
|
||
| final List<SingleFileSnapshotInfo> allSnapshotFiles = getSingleFileSnapshotInfos(stateMachineDir.toPath()); | ||
| // Fetch all the snapshot files irrespective of whether they have an MD5 file or not |
There was a problem hiding this comment.
Move it as javadoc of getSingleFileSnapshotInfos(..).
|
@szetszwo thanks a lot for your continuous inputs. I have addressed the changes in the latest commit |
szetszwo
left a comment
There was a problem hiding this comment.
+1 the change looks good.
What changes were proposed in this pull request?
Snapshot is not an atomic operation and in some cases there might be corrupted snapshots.
Every snapshot has an associated MD5 file with it, if this file is missing we can skip the snapshot.
This PR adds the check for the existence of MD5 file and skips any snapshot which doesn't have it.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/RATIS-244
How was this patch tested?
Patch was tested manually by running the unit tests.