Skip to content

HDDS-14859. Ignore transient errors while validating RocksDb on volumes.#9947

Draft
ptlrs wants to merge 1 commit intoapache:masterfrom
ptlrs:HDDS-14859-Ignore-transient-errors-while-validating-RocksDb-on-volumes
Draft

HDDS-14859. Ignore transient errors while validating RocksDb on volumes.#9947
ptlrs wants to merge 1 commit intoapache:masterfrom
ptlrs:HDDS-14859-Ignore-transient-errors-while-validating-RocksDb-on-volumes

Conversation

@ptlrs
Copy link
Contributor

@ptlrs ptlrs commented Mar 18, 2026

What changes were proposed in this pull request?

In the volume scanner, we open the RocksDb that is present on each volume.
There could be transient errors when opening this RocksDb in readonly mode.

The volume scanner should therefore try to open the RocksDb instance, a second time after a gap, to reduce any false positives when declaring disk failures.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14859

How was this patch tested?

https://github.com/ptlrs/ozone/actions/runs/23264628727

@errose28
Copy link
Contributor

Thanks for looking into this @ptlrs. I think we may want to use a secondary instance instead of a read-only instance for this check. It looks like it will meet the same goal of reading CURRENT and MANIFEST files, but will only fail if the DB is truly in bad health. We will need to provide it a directory to write its own log files, but we can use the volume's specific tmp directory for this, which is already used for other temporary files used during the disk check.

@ptlrs
Copy link
Contributor Author

ptlrs commented Mar 18, 2026

Thanks for taking a look @errose28. Using secondary instance had crossed my mind while writing this PR but that page was the reason I didn't make the change.

In fact I saw this comment in a different discussion which says:

Read-only instances has only the state of the db when the read-only instance is opened. It's not meant to be used when the db is also opened by a normal DB instance. The normal db instance can cause file deletions unaware to the read-only instance. Future reads issued to the read-only instance may hit IOError due to trying to read from non-existing files.

So based on the documentation and the comment, to me it appears that unless you actually perform any reads in RO mode, there won't be a problem.

We don't perform any reads during the volume check but RocksDb does read the metadata from the footer of all SST files as well the MANIFEST files and the WAL when it is opened in RO mode.

If that is where the problem is then, not sure if using a secondary instance would solve our problem.

Secondary instance appears to be the same as RO instance with the extra capability of bringing the instance upto speed with the RW instance using a manual command invocation.
This is nice to have but useful only if you actually read the data.

I don't mind changing to secondary instance, as at worst we would get the same behavior but it would be good to brainstorm what could be the differences.

@errose28
Copy link
Contributor

@ptlrs the main documentation is unfortunately vague here, but there's also this excerpt from the FAQ:

Q: Can I write to RocksDB using multiple processes?

A: No. However, it can be opened using Secondary DB. If no write goes to the database, it can be opened in read-only mode from multiple processes.

Q: Does RocksDB support multi-process read access?

A: Yes, you can read it using secondary database using DB::OpenAsSecondary(). RocksDB can also support multi-process read only process without writing the database. This can be done by opening the database with DB::OpenForReadOnly() call.

In our case we are only using one process with multiple handles. However, since writes will be going to the DB as we are checking the volume, this seems to indicate that secondary instance is still what we want to use here.

private Duration diskCheckTimeout = DISK_CHECK_TIMEOUT_DEFAULT;

@Config(key = DISK_CHECK_RETRY_GAP_KEY,
defaultValue = "1m",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be at least 15m?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't want a long time between two checks as the goal here is to ignore transient errors while opening RocksDb.

If the gap between two consecutive attempts at opening RocksDb is too long, it is no longer about transient errors.

Secondly, we can't have the time period between opening the RocksDb twice be 15 minutes as it will fail the disk check timeout of 10 minutes.

In case of errors when opening the RocksDb, we ideally want the thread to sleep for a very short time, try to open the RocksDb a second time and if we still fail, mark it as an actual failure to open RocksDb.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also increase 10 mins timeout then? in case of 3rd retry we'll spend 2 out of 10 mins in sleep + time for actual processing...

} catch (Exception e) {
if (Thread.currentThread().isInterrupted()) {
throw new InterruptedException("Check of database for volume " + this + " interrupted.");
final int maxAttempts = 2;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought process for not making this configurable was to avoid problematic configurations where the total attempts could exceed the disk check timeout.

For example if maxAttempts is 10 and you sleep for 1 minute in each attempt, firstly this is no longer about transient errors and the total time of 10 minutes will make the method flaky and fail the disk check timeout of 10 minutes.

I can change it if we think we may still want that control in production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants