HDDS-14859. Ignore transient errors while validating RocksDb on volumes.#9947
Conversation
|
Thanks for looking into this @ptlrs. I think we may want to use a secondary instance instead of a read-only instance for this check. It looks like it will meet the same goal of reading CURRENT and MANIFEST files, but will only fail if the DB is truly in bad health. We will need to provide it a directory to write its own log files, but we can use the volume's specific |
|
Thanks for taking a look @errose28. Using secondary instance had crossed my mind while writing this PR but that page was the reason I didn't make the change. In fact I saw this comment in a different discussion which says:
So based on the documentation and the comment, to me it appears that unless you actually perform any reads in RO mode, there won't be a problem. We don't perform any reads during the volume check but RocksDb does read the metadata from the footer of all SST files as well the MANIFEST files and the WAL when it is opened in RO mode. If that is where the problem is then, not sure if using a secondary instance would solve our problem. Secondary instance appears to be the same as RO instance with the extra capability of bringing the instance upto speed with the RW instance using a manual command invocation. I don't mind changing to secondary instance, as at worst we would get the same behavior but it would be good to brainstorm what could be the differences. |
|
@ptlrs the main documentation is unfortunately vague here, but there's also this excerpt from the FAQ:
In our case we are only using one process with multiple handles. However, since writes will be going to the DB as we are checking the volume, this seems to indicate that secondary instance is still what we want to use here. |
| private Duration diskCheckTimeout = DISK_CHECK_TIMEOUT_DEFAULT; | ||
|
|
||
| @Config(key = DISK_CHECK_RETRY_GAP_KEY, | ||
| defaultValue = "1m", |
There was a problem hiding this comment.
No, we don't want a long time between two checks as the goal here is to ignore transient errors while opening RocksDb.
If the gap between two consecutive attempts at opening RocksDb is too long, it is no longer about transient errors.
Secondly, we can't have the time period between opening the RocksDb twice be 15 minutes as it will fail the disk check timeout of 10 minutes.
In case of errors when opening the RocksDb, we ideally want the thread to sleep for a very short time, try to open the RocksDb a second time and if we still fail, mark it as an actual failure to open RocksDb.
There was a problem hiding this comment.
should we also increase 10 mins timeout then? in case of 3rd retry we'll spend 2 out of 10 mins in sleep + time for actual processing...
| } catch (Exception e) { | ||
| if (Thread.currentThread().isInterrupted()) { | ||
| throw new InterruptedException("Check of database for volume " + this + " interrupted."); | ||
| final int maxAttempts = 2; |
There was a problem hiding this comment.
My thought process for not making this configurable was to avoid problematic configurations where the total attempts could exceed the disk check timeout.
For example if maxAttempts is 10 and you sleep for 1 minute in each attempt, firstly this is no longer about transient errors and the total time of 10 minutes will make the method flaky and fail the disk check timeout of 10 minutes.
I can change it if we think we may still want that control in production.
What changes were proposed in this pull request?
In the volume scanner, we open the RocksDb that is present on each volume.
There could be transient errors when opening this RocksDb in readonly mode.
The volume scanner should therefore try to open the RocksDb instance, a second time after a gap, to reduce any false positives when declaring disk failures.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14859
How was this patch tested?
https://github.com/ptlrs/ozone/actions/runs/23264628727