HDDS-14800. Guard RocksDB iterator against closed DB during volume failure#9904
HDDS-14800. Guard RocksDB iterator against closed DB during volume failure#9904priyeshkaratha wants to merge 15 commits intoapache:masterfrom
Conversation
|
@priyeshkaratha , could you add a unit test for BackgroundContainerDataScanner? |
22bc2d7 to
765500e
Compare
...ainer-service/src/main/java/org/apache/hadoop/ozone/container/metadata/AbstractRDBStore.java
Show resolved
Hide resolved
...ainer-service/src/main/java/org/apache/hadoop/ozone/container/metadata/AbstractRDBStore.java
Outdated
Show resolved
Hide resolved
...p-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBStoreAbstractIterator.java
Show resolved
Hide resolved
...framework/src/test/java/org/apache/hadoop/hdds/utils/db/TestRDBStoreIteratorWithDBClose.java
Show resolved
Hide resolved
...p-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBStoreAbstractIterator.java
Show resolved
Hide resolved
|
Let's add some test for the race condition scenarios. |
devmadhuu
left a comment
There was a problem hiding this comment.
@priyeshkaratha overall changes LGTM +1. Just few nits. Pls check.
...p-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBStoreAbstractIterator.java
Show resolved
Hide resolved
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RocksDatabase.java
Show resolved
Hide resolved
...p-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBStoreAbstractIterator.java
Outdated
Show resolved
Hide resolved
...p-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBStoreAbstractIterator.java
Show resolved
Hide resolved
...p-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBStoreAbstractIterator.java
Outdated
Show resolved
Hide resolved
|
Thanks @devmadhuu for the review. Addressed all your comments in latest patch. |
devmadhuu
left a comment
There was a problem hiding this comment.
Thanks @priyeshkaratha for improving the patch. Just one minor comment observed. Pls check.
...p-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBStoreAbstractIterator.java
Outdated
Show resolved
Hide resolved
devmadhuu
left a comment
There was a problem hiding this comment.
Thanks @priyeshkaratha. Now patch LGTM +1
Gargi-jais11
left a comment
There was a problem hiding this comment.
Thanks @priyeshkaratha for updating the patch. LGTM.
errose28
left a comment
There was a problem hiding this comment.
Overall LGTM but Cursor has a sharper eye. This bug it found looks legitimate, as does the recommended fix:
While the RDBStoreAbstractIterator holds a lock for its lifecycle, there is a Time-Of-Check to Time-Of-Use (TOCTOU) gap when instantiating it inside RDBTable.
In RDBTable.java, the iterator is constructed like this:
return new RDBStoreByteArrayIterator(db.newIterator(family, false), this, prefix, type);
Here is how the race condition manifests:
- db.newIterator(...) is called. Inside RocksDatabase, this method opens a try-with-resources block, calls acquire(), creates the native RocksDB iterator, and drops the lock upon returning.
- [RACE WINDOW] The counter is now 0. If a concurrent failVolume() triggers waitAndClose(), it succeeds because counter is 0, and the DB is immediately destroyed natively.
- The constructor new RDBStoreByteArrayIterator(...) is now called with the created (but now orphaned) native iterator.
- The constructor attempts table.acquireIterator(), which fails because the DB is closed.
- In the catch block, it calls iterator.close() to avoid a leak. However, because the underlying database was already physically destroyed in Step 2, calling close() on the native iterator triggers a Use-After-Free condition, which will likely crash the JVM.
Fix Recommendation:
Move the acquire() logic out of the iterator's constructor and into RDBTable. RDBTable should acquire the dbRef lock before instantiating the native iterator, and pass the acquired lock into the RDBStoreAbstractIterator constructor.
Additionally it identified unprotected uses in the OM. I don't think these are as critical since OM volume failure should be fatal, but it would be good to file a follow-up Jira to check these.
@ptlrs do you have any comments on this change?
ptlrs
left a comment
There was a problem hiding this comment.
Thanks for the PR @priyeshkaratha.
Iteration can possibly be a long running operation. In case of volume failure, we don't want to keep the RocksDb forced open just because somebody is iterating on that Db. There could be other operations running on the Db such as compaction and running those on a faulty disk can result in the Db entering an inconsistent state.
This requires a cooperation mechanism amongst the clients to check if there is a pending request to close the Db and if so to close the iterator early. A new PR is needed to address this.
Some of my earlier concerns around the refcounting have already been addressed by other comments.
While this change looks mostly good to me, I do believe that the refcounting should happen at the handle level instead of the operation level.
Refcounting at operation level can have subtle bugs and is at best only valid during the lifetime of operation. It may result in partial completion of methods which could result in inconsistent/partial state.
The usage of ReferenceCountedDB over RawDB could possibly be the generalized solution.
...p-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBStoreAbstractIterator.java
Outdated
Show resolved
Hide resolved
...p-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBStoreAbstractIterator.java
Show resolved
Hide resolved
...p-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBStoreAbstractIterator.java
Outdated
Show resolved
Hide resolved
...p-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBStoreAbstractIterator.java
Outdated
Show resolved
Hide resolved
...p-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBStoreAbstractIterator.java
Outdated
Show resolved
Hide resolved
sumitagrawl
left a comment
There was a problem hiding this comment.
@priyeshkaratha This will fix only TypedTable and race condition as mentioned by Ethan, we can add fix to ManagedRocksDBIterator and this should be use for every access,
- Check if isClosed() and isValid()
- Also implicitly managed by leak detection if reference acquired is not released.
Like below, so that when ever its accessed. There will be changes to all access and some code impact.
public class ManagedRocksIterator extends ManagedObject<RocksIterator> {
AutoCloseable acquire;
public ManagedRocksIterator(RocksIterator original, IRoskDBClosable dbClosable) {
super(original);
acquire = dbClosable.acquire();
}
@Override
public RocksIterator get() throws RocksDBException {
if (dbClosable.isClosed()) {
throw new RocksDBException("Rocksdb is closed");
}
if (super.get().isValid()) {
throw new RocksDBException("iterator is not valid");
}
return super.get();
}
@Override
public void close() {
super.close();
acquire.close();
}
public static ManagedRocksIterator managed(RocksIterator iterator, IRoskDBClosable dbClosable) {
return new ManagedRocksIterator(iterator, dbClosable);
}
}
|
Thanks @errose28 @sumitagrawl @ptlrs for the reviews. @errose28 @sumitagrawl @ptlrs I have addressed your comments in individual sections. |
ptlrs
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the changes @priyeshkaratha.
errose28
left a comment
There was a problem hiding this comment.
Thanks for the updates @priyeshkaratha LGTM.
...ainer-service/src/main/java/org/apache/hadoop/ozone/container/metadata/AbstractRDBStore.java
Outdated
Show resolved
Hide resolved
...p-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBStoreAbstractIterator.java
Show resolved
Hide resolved
|
Thanks @ChenSammi for the review. I have updated the patch. |
ChenSammi
left a comment
There was a problem hiding this comment.
Thanks @priyeshkaratha , the last patch LGTM.
What changes were proposed in this pull request?
When
StorageVolumeCheckerdetects a volume failure and callsfailVolume(), it closes the underlying RocksDB instance whileBackgroundContainerDataScannerorOnDemandContainerScannermay still hold an active iterator over that DB, calling native RocksDB methods on a closed DB can cause a crash. This fix adds two complementary guards toRDBStoreAbstractIterator: a fast-fail check sohasNext()returns false immediately once the DB is closed (stopping the scan loop without touching native code), and reference counting by acquiring a slot on RocksDatabase.counter at iterator creation and releasing it on close(), so the existingwaitAndClose()mechanism waits for all iterators to finish before physically closing the DB. Together these ensure the scan exits cleanly and the DB cannot be destroyed while an iterator is still in use.What is the link to the Apache JIRA
HDDS-14800
How was this patch tested?
Tested using added unit test cases.