HDDS-14768. Fix lock leak during snapshot cache cleanup and handle eviction race appropriately. by SaketaChalamchala · Pull Request #9869 · apache/ozone

SaketaChalamchala · 2026-03-05T21:23:11Z

What changes were proposed in this pull request?

Currently,

Eviction race: SnapshotCache cleanup throws an IllegalStateException when it finds stale entries in pendingEvictionQueue for snapshots that have already been removed from dbMap
Ex., say SnapshotPurge invalidates the entry right before the last thread with a reference to the snapshot just closes adding the snapshotID back to the evictionQueue
Inconsistent Bookkeeping: invalidate removes snapshot entry from dbMap but does not remove it from pendingEvictionQueue if it exists.
Potential snapshot leak: Snapshot close failure during cleanup removes the snapshotID from eviction queue and throws an exception. This causes the snapshot to remain in cache even is refCount = 0 and the snapshot entry remains in dbMap unless
some other thread explicitly invalidates it or references it again. This means SnapshotCache.lock() during this time cannot hold the write lock because lock() expects the cache to be drained.
Write lock leak: Fix write lock leak in SnapshotCache. If the cache drain cleanup(true) throws an exception write lock is not released in SnapshotCache.lock()

Proposed solution:

Handle eviction race appropriately. Log the stale snapshot entry in eviction queue and remove it from the queue.
Remove snapshot entry from eviction queu upon successful invalidation.
Log snapshot close failure during cleanup but do not remove it from eviction queue so that it's cleanup can be retried later.
Catch any unchecked exception during cleanup and release the write lock in SnapshotCache.lock()

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14768

(Please replace this section with the link to the Apache JIRA)

How was this patch tested?

Unit tests.

…iction race.

Copilot

Pull request overview

Fixes SnapshotCache eviction/cleanup edge cases that could previously throw during cleanup, leak the snapshot DB write lock, or leave stale eviction entries behind—improving correctness and reliability of snapshot purge / checkpoint coordination in Ozone Manager.

Changes:

Remove snapshot IDs from the pending eviction queue on invalidate, and tolerate stale eviction entries during cleanup.
Ensure SnapshotCache write lock is released if cleanup throws (including unchecked exceptions).
Adjust cleanup behavior so snapshot close failures are logged and retried later, with added unit tests covering these races/failures.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java	Updates eviction bookkeeping, cleanup behavior on stale/missing entries and close failures, and hardens lock() to release locks on exceptions.
hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotCache.java	Adds unit tests for stale-eviction cleanup, close-failure retry behavior, and write-lock release when cleanup throws.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

...ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotCache.java

Copilot · 2026-03-09T03:05:22Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

+        LOG.warn("SnapshotId '" + k + "' does not exist in cache. The RocksDB " +
            "instance of the Snapshot may not be closed properly.");


The new warning for stale eviction keys (v == null) says the RocksDB instance “may not be closed properly”, but this condition can also occur in the expected invalidate + late close race (snapshot was closed and removed from dbMap, and the callback re-queued the UUID). Consider rewording the log to reflect that this can be a benign stale-queue entry, and include guidance only if it indicates a real leak signal.

Suggested change

LOG.warn("SnapshotId '" + k + "' does not exist in cache. The RocksDB " +

"instance of the Snapshot may not be closed properly.");

LOG.warn("SnapshotId '{}' not found in cache during cleanup. "

+ "This can happen if the snapshot was already closed and "

+ "removed from the cache, leaving a stale entry in the eviction "

+ "queue. If this message appears frequently for the same "

+ "snapshot or the cache size keeps growing, it may indicate that "

+ "a RocksDB instance was not closed properly.");

Copilot · 2026-03-09T03:05:22Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

+      } else if (v.getTotalRefCount() > 0) {
        LOG.debug("SnapshotId {} is still being referenced ({}), skipping its clean up.", k, v.getTotalRefCount());
-        return v;
+        result = v;
      } else {
        LOG.debug("Closing SnapshotId {}. It is not being referenced anymore.", k);
        // Close the instance, which also closes its DB handle.
        try {
          v.get().close();
        } catch (IOException ex) {
-          throw new IllegalStateException("Error while closing snapshot DB.", ex);
+          LOG.error("Error while closing snapshot DB.", ex);
+          return v;
        }
        omMetrics.decNumSnapshotCacheSize();
-        return null;
      }
+      pendingEvictionQueue.remove(k);
+      return result;
    });


In cleanup(evictionKey, ...), the pending eviction key is removed even when the snapshot is still referenced (totalRefCount > 0). If another thread decrements the refcount to 0 and the callback re-adds the key concurrently, this removal can win and drop the key permanently, leaving an unreferenced snapshot in dbMap that will never be retried for cleanup. Consider either not removing the key when refcount > 0, or removing it conditionally and re-adding if the refcount becomes 0 after the decision (or otherwise synchronizing queue updates).

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

Copilot · 2026-03-09T03:05:22Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

  private UncheckedAutoCloseableSupplier<OMLockDetails> lock(Supplier<OMLockDetails> lockFunction,
      Supplier<OMLockDetails> unlockFunction, Supplier<Boolean> cleanupFunction) {
    Supplier<OMLockDetails> emptyLockFunction = () -> getEmptyOmLockDetails(lockFunction.get());
    Supplier<OMLockDetails> emptyUnlockFunction = () -> getEmptyOmLockDetails(unlockFunction.get());

    AtomicReference<OMLockDetails> lockDetails = new AtomicReference<>(emptyLockFunction.get());
    if (lockDetails.get().isLockAcquired()) {
-      if (!cleanupFunction.get()) {
+      try {
+        if (!cleanupFunction.get()) {
+          lockDetails.set(emptyUnlockFunction.get());
+        }
+      } catch (Throwable t) {
        lockDetails.set(emptyUnlockFunction.get());
+        throw t;
      }


SnapshotCache.lock(...) can return a supplier whose OMLockDetails reports lock not acquired when cleanupFunction returns false (eg, cache not drained). The public lock() / lock(UUID) Javadocs currently describe the lock as ensuring thread-safety, but don’t mention that callers must check isLockAcquired() (some call sites use try-with-resources without checking). Consider either documenting this contract explicitly here or throwing when cleanup can’t satisfy the precondition so callers can’t proceed without the intended lock.

Updated the patch to throw an exception when cleanup returns false so that callers do not proceed without the lock.

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

smengcl · 2026-03-09T05:09:38Z

...ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotCache.java

+      Field f = SnapshotCache.class.getDeclaredField("pendingEvictionQueue");
+      f.setAccessible(true);
+      return (Set<UUID>) f.get(cache);


Using reflection to access private field in SnapshotCache is unstable. Why not just expose it via a getPendingEvictionQueue() method in SnapshotCache that has @VisibleForTesting annotation?

jojochuang · 2026-03-06T22:12:59Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

   * @param key SnapshotId
   */
  public void invalidate(UUID key) {
    dbMap.compute(key, (k, v) -> {


SnapshotCache.invalidate() is called during OMSnapshotPurgeResponse.

jojochuang · 2026-03-06T22:55:17Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

        }
        omMetrics.decNumSnapshotCacheSize();
      }
+      pendingEvictionQueue.remove(k);


pendingEvictionQueue may have the entry k, if snapshot purge response is happening after, and all references of the snapshot is decremented (releasing the SnapshotCache lock of the key), but before the periodic cleanup thread kicks in.

note: this case should not happen during checkpointing, because checkpoints holds a write lock of the snapshot cache, and once released, it invokes cleanup() immediately.

jojochuang · 2026-03-06T23:19:29Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java


    AtomicReference<OMLockDetails> lockDetails = new AtomicReference<>(emptyLockFunction.get());
    if (lockDetails.get().isLockAcquired()) {
-      if (!cleanupFunction.get()) {


the lock is released by calling emptyUnlockFunction.get(), if cleanup operation was not successful or if it throws a Throwable.

SaketaChalamchala · 2026-03-10T22:43:09Z

Thanks for the review @sadanand48 , @smengcl and @jojochuang I addressed your comments and relevant copilot suggestions as well

Additional updates:

reverted the error log to an exception in cleanup so that it fails fast.
Throwing an exception when cleanup function returns false so that callers that do not check for isLockAcquired() do not proceed without the lock.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

Copilot · 2026-03-12T15:36:33Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

+        if (!cleanupFunction.get()) {
+          lockDetails.set(emptyUnlockFunction.get());
+          throw new IllegalStateException("Failed to acquire lock as cleanup did not drain the cache.");
+        }


lock(...) now throws an IllegalStateException when cleanup does not fully drain the cache. Since this is a public API and some callers (e.g., checkpoint streaming) may not be expecting a runtime exception here, it would be safer to either (a) document this behavior explicitly in the lock()/lock(UUID) javadocs, or (b) preserve the previous contract by returning non-acquired lock details instead of throwing.

Copilot · 2026-03-12T15:36:33Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

+        LOG.warn("SnapshotId '{}' does not exist in cache. The RocksDB " +
+            "instance of the Snapshot may not be closed properly.", k);


cleanup() now treats a missing cache entry as non-fatal (to handle stale eviction keys), but still logs it at WARN with a message implying a RocksDB resource leak. Since this case is now expected/benign (see the new stale-eviction-key test), consider lowering this to DEBUG/INFO and/or updating the message to reflect that the key may have been invalidated and already closed.

Suggested change

LOG.warn("SnapshotId '{}' does not exist in cache. The RocksDB " +

"instance of the Snapshot may not be closed properly.", k);

LOG.debug("SnapshotId '{}' is not present in cache during cleanup; "

+ "it may have already been invalidated, closed, and removed.", k);

jojochuang · 2026-03-18T07:47:08Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

@@ -377,26 +388,25 @@ private synchronized Void cleanup(UUID evictionKey, boolean expectKeyToBePresent
    }


this means the entry in the SnapshotCache is removed before cleanup() is called.

it should not happen during Snapshot Purge Response, because it explicitly invalidate snapshot cache.

jojochuang · 2026-03-18T07:47:10Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

-      }
-
-      if (v.getTotalRefCount() > 0) {
+        LOG.warn("SnapshotId '{}' does not exist in cache. The RocksDB " +


this scenario is possible if SnapshotDeletingService closes the reference to the snapshot after the snapshot is deleted, which is entirely possible.

jojochuang · 2026-03-18T07:51:58Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

+        LOG.warn("SnapshotId '{}' does not exist in cache. The RocksDB " +
+            "instance of the Snapshot may not be closed properly.", k);


i'd suggest to make this warning message even more benign, as suggested by Copilot. We do not want to spook users.

In fact, since this is expected. We could reduce the log level to INFO.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

jojochuang · 2026-03-18T16:39:19Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

+          throw new IllegalStateException("Failed to acquire lock as cleanup did not drain the cache.");
+        }
+      } catch (Throwable t) {
        lockDetails.set(emptyUnlockFunction.get());


...ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotCache.java

+  @DisplayName("Close failure keeps snapshot in eviction queue for retry")
+  void testCloseFailureRetriesSnapshot() throws Exception {
+
+    snapshotCache = new SnapshotCache(cacheLoader, CACHE_SIZE_LIMIT, omMetrics, 0, true, newAcquiringLock());


jojochuang · 2026-03-18T16:39:11Z

...ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotCache.java

+    assertTrue(snapshotCache.getPendingEvictionQueue().contains(snapshotId));
+
+    // First cleanup attempt fails to close; entry should remain in dbMap and key should stay queued for retry.
+    assertThrows(IllegalStateException.class, () -> snapshotCache.lock());


jojochuang · 2026-03-18T16:52:22Z

lgtm. I'll commit with the few suggestions Copilot added.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Change-Id: I8a8dc358745d0dbcce0623b8cda37718de9e76ac

Change-Id: Ibb1c44266a13dc1f10d222b1dbc4a5fbd27a19c0

jojochuang · 2026-03-19T04:47:38Z

Merged. Thanks @SaketaChalamchala

SaketaChalamchala added 2 commits March 5, 2026 13:06

HDDS-14768. Fix lock leak during snapshot cache cleanup and handle ev…

89f51cc

…iction race.

HDDS-14768. Added unit tests.

abb4660

jojochuang requested review from aswinshakil, sadanand48 and smengcl March 6, 2026 20:50

jojochuang marked this pull request as ready for review March 7, 2026 00:04

smengcl requested review from Copilot and removed request for aswinshakil, sadanand48 and smengcl March 9, 2026 02:59

Copilot started reviewing on behalf of smengcl March 9, 2026 02:59 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

smengcl reviewed Mar 9, 2026

View reviewed changes

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java Outdated Show resolved Hide resolved

smengcl reviewed Mar 9, 2026

View reviewed changes

jojochuang added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Mar 9, 2026

jojochuang reviewed Mar 9, 2026

View reviewed changes

HDDS-14768. Addressed comments.

aa5528d

jojochuang requested review from Copilot, jojochuang, sadanand48 and smengcl March 12, 2026 15:26

Copilot started reviewing on behalf of jojochuang March 12, 2026 15:29 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

HDDS-14768. Addressed copilot comments.

e90a965

jojochuang reviewed Mar 18, 2026

View reviewed changes

HDDS-14768. Reduced bening logger to info.

0d130af

SaketaChalamchala requested review from Copilot and jojochuang March 18, 2026 14:48

Copilot AI reviewed Mar 18, 2026

View reviewed changes

Copilot started reviewing on behalf of SaketaChalamchala March 18, 2026 15:15 View session

jojochuang and others added 3 commits March 18, 2026 09:52

Potential fix for pull request finding

c1678c0

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

copilot suggested updates

0d4ebb9

Change-Id: I8a8dc358745d0dbcce0623b8cda37718de9e76ac

Remove extra line

c84828d

Change-Id: Ibb1c44266a13dc1f10d222b1dbc4a5fbd27a19c0

jojochuang merged commit 243be86 into apache:master Mar 19, 2026
45 checks passed

		LOG.warn("SnapshotId '" + k + "' does not exist in cache. The RocksDB " +
		"instance of the Snapshot may not be closed properly.");

-        LOG.warn("SnapshotId '" + k + "' does not exist in cache. The RocksDB " +
-            "instance of the Snapshot may not be closed properly.");
+        LOG.warn("SnapshotId '{}' not found in cache during cleanup. "
+            + "This can happen if the snapshot was already closed and "
+            + "removed from the cache, leaving a stale entry in the eviction "
+            + "queue. If this message appears frequently for the same "
+            + "snapshot or the cache size keeps growing, it may indicate that "
+            + "a RocksDB instance was not closed properly.");

		LOG.warn("SnapshotId '{}' does not exist in cache. The RocksDB " +
		"instance of the Snapshot may not be closed properly.", k);

		@@ -377,26 +388,25 @@ private synchronized Void cleanup(UUID evictionKey, boolean expectKeyToBePresent
		}

Conversation

SaketaChalamchala commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SaketaChalamchala commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jojochuang commented Mar 18, 2026

Uh oh!

Uh oh!

jojochuang commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SaketaChalamchala commented Mar 5, 2026 •

edited

Loading