Skip to content

HDDS-14800. Guard RocksDB iterator against closed DB during volume failure#9904

Draft
priyeshkaratha wants to merge 15 commits intoapache:masterfrom
priyeshkaratha:HDDS-14800
Draft

HDDS-14800. Guard RocksDB iterator against closed DB during volume failure#9904
priyeshkaratha wants to merge 15 commits intoapache:masterfrom
priyeshkaratha:HDDS-14800

Conversation

@priyeshkaratha
Copy link
Contributor

What changes were proposed in this pull request?

When StorageVolumeChecker detects a volume failure and calls failVolume(), it closes the underlying RocksDB instance while BackgroundContainerDataScanner or OnDemandContainerScanner may still hold an active iterator over that DB, calling native RocksDB methods on a closed DB can cause a crash. This fix adds two complementary guards to RDBStoreAbstractIterator: a fast-fail check so hasNext() returns false immediately once the DB is closed (stopping the scan loop without touching native code), and reference counting by acquiring a slot on RocksDatabase.counter at iterator creation and releasing it on close(), so the existing waitAndClose() mechanism waits for all iterators to finish before physically closing the DB. Together these ensure the scan exits cleanly and the DB cannot be destroyed while an iterator is still in use.

What is the link to the Apache JIRA

HDDS-14800

How was this patch tested?

Tested using added unit test cases.

@ChenSammi
Copy link
Contributor

@priyeshkaratha , could you add a unit test for BackgroundContainerDataScanner?

@priyeshkaratha priyeshkaratha marked this pull request as ready for review March 11, 2026 10:39
@Gargi-jais11
Copy link
Contributor

Gargi-jais11 commented Mar 13, 2026

Let's add some test for the race condition scenarios.

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyeshkaratha overall changes LGTM +1. Just few nits. Pls check.

@priyeshkaratha
Copy link
Contributor Author

Thanks @devmadhuu for the review. Addressed all your comments in latest patch.

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @priyeshkaratha for improving the patch. Just one minor comment observed. Pls check.

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @priyeshkaratha. Now patch LGTM +1

Copy link

@yandrey321 yandrey321 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@Gargi-jais11 Gargi-jais11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @priyeshkaratha for updating the patch. LGTM.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM but Cursor has a sharper eye. This bug it found looks legitimate, as does the recommended fix:


While the RDBStoreAbstractIterator holds a lock for its lifecycle, there is a Time-Of-Check to Time-Of-Use (TOCTOU) gap when instantiating it inside RDBTable.
In RDBTable.java, the iterator is constructed like this:

return new RDBStoreByteArrayIterator(db.newIterator(family, false), this, prefix, type);

Here is how the race condition manifests:

  1. db.newIterator(...) is called. Inside RocksDatabase, this method opens a try-with-resources block, calls acquire(), creates the native RocksDB iterator, and drops the lock upon returning.
  2. [RACE WINDOW] The counter is now 0. If a concurrent failVolume() triggers waitAndClose(), it succeeds because counter is 0, and the DB is immediately destroyed natively.
  3. The constructor new RDBStoreByteArrayIterator(...) is now called with the created (but now orphaned) native iterator.
  4. The constructor attempts table.acquireIterator(), which fails because the DB is closed.
  5. In the catch block, it calls iterator.close() to avoid a leak. However, because the underlying database was already physically destroyed in Step 2, calling close() on the native iterator triggers a Use-After-Free condition, which will likely crash the JVM.

Fix Recommendation:
Move the acquire() logic out of the iterator's constructor and into RDBTable. RDBTable should acquire the dbRef lock before instantiating the native iterator, and pass the acquired lock into the RDBStoreAbstractIterator constructor.


Additionally it identified unprotected uses in the OM. I don't think these are as critical since OM volume failure should be fatal, but it would be good to file a follow-up Jira to check these.

@ptlrs do you have any comments on this change?

Copy link
Contributor

@ptlrs ptlrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @priyeshkaratha.

Iteration can possibly be a long running operation. In case of volume failure, we don't want to keep the RocksDb forced open just because somebody is iterating on that Db. There could be other operations running on the Db such as compaction and running those on a faulty disk can result in the Db entering an inconsistent state.
This requires a cooperation mechanism amongst the clients to check if there is a pending request to close the Db and if so to close the iterator early. A new PR is needed to address this.

Some of my earlier concerns around the refcounting have already been addressed by other comments.

While this change looks mostly good to me, I do believe that the refcounting should happen at the handle level instead of the operation level.

Refcounting at operation level can have subtle bugs and is at best only valid during the lifetime of operation. It may result in partial completion of methods which could result in inconsistent/partial state.

The usage of ReferenceCountedDB over RawDB could possibly be the generalized solution.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyeshkaratha This will fix only TypedTable and race condition as mentioned by Ethan, we can add fix to ManagedRocksDBIterator and this should be use for every access,

  1. Check if isClosed() and isValid()
  2. Also implicitly managed by leak detection if reference acquired is not released.

Like below, so that when ever its accessed. There will be changes to all access and some code impact.

public class ManagedRocksIterator extends ManagedObject<RocksIterator> {
  AutoCloseable acquire;

  public ManagedRocksIterator(RocksIterator original, IRoskDBClosable dbClosable) {
    super(original);
    acquire =  dbClosable.acquire();
  }

  @Override
  public RocksIterator get() throws RocksDBException {
    if (dbClosable.isClosed()) {
      throw new RocksDBException("Rocksdb is closed");
    }
    if (super.get().isValid()) {
       throw new RocksDBException("iterator is not valid");
    }
    return super.get();
  }

  @Override
  public void close() {
    super.close();
    acquire.close();
  }

  public static ManagedRocksIterator managed(RocksIterator iterator, IRoskDBClosable dbClosable) {
    return new ManagedRocksIterator(iterator, dbClosable);
  }
}

@priyeshkaratha
Copy link
Contributor Author

Thanks @errose28 @sumitagrawl @ptlrs for the reviews.

@errose28
I have refactored the code to move the acquire() logic out of the RDBStoreAbstractIterator. The reference is now safely acquired before the native iterator is passed in, preventing the Use-After-Free crash. I will also file a follow-up Jira to audit the unprotected uses in OM as you suggested - https://issues.apache.org/jira/browse/HDDS-14856

@sumitagrawl
I completely agree that managing this at the ManagedRocksIterator level is a more robust and generalized approach. The recent changes align with this by moving the reference counting (dbRef) out of the RDBStoreAbstractIterator and handling it at the handle/managed level, ensuring it applies to all accesses and not just TypedTable

@ptlrs I have addressed your comments in individual sections.

Copy link
Contributor

@ptlrs ptlrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the changes @priyeshkaratha.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @priyeshkaratha LGTM.

@priyeshkaratha
Copy link
Contributor Author

Thanks @ChenSammi for the review. I have updated the patch.

Copy link
Contributor

@ChenSammi ChenSammi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @priyeshkaratha , the last patch LGTM.

@adoroszlai adoroszlai marked this pull request as draft March 19, 2026 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants