Skip to content

fix(chainbase): defensive checks and resource leak fixes#6633

Closed
barbatos2011 wants to merge 3 commits intotronprotocol:developfrom
barbatos2011:fix/defensive-code-safety
Closed

fix(chainbase): defensive checks and resource leak fixes#6633
barbatos2011 wants to merge 3 commits intotronprotocol:developfrom
barbatos2011:fix/defensive-code-safety

Conversation

@barbatos2011
Copy link
Copy Markdown

What

Replace assert statements with unconditional runtime checks on consensus-critical paths, fix division-by-zero risks, improve thread safety in Chainbase, and fix resource leaks across the entire database close() chain.

Why

Assert statements are disabled by default in production JVM (no -ea flag). On consensus-critical paths, this creates three failure modes depending on JVM configuration:

  • Without -ea (production default): assertions are silently skipped, code continues with invalid state (e.g., totalEnergyWeight = 0 causes (double) limit / 0LInfinity(long) Infinity = Long.MAX_VALUE, giving an account effectively unlimited energy)
  • With -ea: AssertionError (extends Error, not Exception) is thrown, bypassing all catch (Exception) handlers in the transaction processing chain, crashing the block processing thread
  • Mixed network: nodes with different -ea flags produce different execution results for the same block → consensus split

Replacing with if-throw using RuntimeException subclasses ensures: (1) the check always executes regardless of JVM flags, (2) the exception is caught by existing error handling, failing the transaction without crashing the node, (3) these conditions cannot be triggered by valid transactions on a correctly functioning network — they guard against data corruption or independent bugs in other subsystems.

Resource leaks in close() methods follow the same root cause across 4 classes: writeOptions.close() and dbSource.closeDB() in a single try block, where the first failure prevents the second from executing. The SnapshotManager.checkV2() leak is a separate issue: close() is not called at all when recover() throws.

Changes

Commit 1: Assert → if-throw replacement

Replace 5 assert statements with explicit if-throw checks:

Location Condition Exception
RepositoryImpl.increase() now < lastTime (time backwards) IllegalArgumentException
RepositoryImpl.calculateGlobalEnergyLimit() totalEnergyWeight <= 0 IllegalStateException
EnergyProcessor.calculateGlobalEnergyLimit() same, in else branch symmetric with allowNewReward() early return IllegalStateException
ResourceProcessor.increase() now < lastTime (same as RepositoryImpl) IllegalArgumentException
MerklePath.encode() authenticationPath.size() != index.size() ZksnarkException

Add MerklePathEncodeTest and DefensiveChecksTest covering all 5 exception paths.

Commit 2: Defensive numeric and structural fixes

  • TransactionTrace: add if (newSize == 0) guard before newArea / newSize division in resetAccountUsage() and resetAccountUsageV2() to prevent ArithmeticException. The zero condition can occur when the caller account has no prior energy consumption history (raw window size = 0)
  • ResilienceService: replace all new Random() with ThreadLocalRandom.current() to avoid per-call object allocation in a scheduled task; remove Random field from WeightedRandom inner class; add totalWeight <= 0 guard before nextInt(). SecureRandom is unnecessary here — random peer disconnection only needs statistical uniformity, not cryptographic unpredictability
  • DbConvert: merge two parallel List<byte[]> (keys/values) into a single List<Map.Entry<byte[], byte[]>>, eliminating index-mismatch risk at the type level. batchInsert() is a private method only called within convertLevelToRocks()

Add TransactionTraceTest cases for newSize == 0 on both V1 and V2 paths.

Commit 3: Thread safety and resource leak fixes

  • Chainbase.head: declare as volatile for cross-thread visibility. Read methods (getUnchecked, prefixQuery, etc.) are not synchronized, so without volatile reader threads may see stale references under JMM
  • Chainbase.prefixQuery(): snapshot head into a local variable at method entry, pass it to both prefixQueryRoot() and prefixQuerySnapshot(). Previously, these two sub-methods read head independently — if setHead() is called between the two reads, the results are from different snapshots. volatile alone does not fix this; the local snapshot guarantees single-call consistency. Other read methods (get, getUnchecked, has) only read head once via head(), so volatile alone is sufficient for them
  • TronDatabase.close(): split writeOptions.close() and dbSource.closeDB() into separate try-catch blocks so one failure does not skip the other
  • LevelDB.close() / RocksDB.close(): same split try-catch pattern (discovered by tracing the full close() chain — these had the identical bug)
  • CheckPointV2Store.close(): close own writeOptions in try-catch, then call super.close() instead of dbSource.closeDB() to ensure the full parent cleanup chain executes. Note: subclass and parent each declare their own writeOptions field (Java fields are not polymorphic), both must be closed separately
  • SnapshotManager.checkV2(): wrap getCheckpointDB() in try-with-resources so close() runs even if recover() throws

Add ResourceCloseTest (mock writeOptions.close() to throw, verify dbSource.closeDB() is still called) and ChainbaseConcurrentTest (4 threads, 500 iterations of concurrent setHead + prefixQuery).

Scope

  • Does NOT change consensus rules or transaction validation logic
  • The assert replacements guard conditions that cannot be triggered by valid transactions on a correctly functioning mainnet — no hard fork required
  • Does NOT modify any public API or protobuf definitions

Test

  • All existing framework tests pass (2332 tests, 8 pre-existing net module flaky failures unrelated to this PR)
  • Added 4 new test classes, 12 new test methods:
    • MerklePathEncodeTest: size mismatch → ZksnarkException; empty lists → no exception
    • DefensiveChecksTest: energy weight zero throws/returns-zero based on allowNewReward(); time-backwards throws; RepositoryImpl zero weight throws
    • TransactionTraceTest (2 new methods): resetAccountUsage V1/V2 handle newSize == 0
    • ResourceCloseTest: exception-safe close chain verification with Mockito
    • ChainbaseConcurrentTest: concurrent read/write safety
  • Checkstyle passes

Barbatos added 3 commits April 2, 2026 04:44
Replace 5 assert statements with unconditional runtime checks on
consensus-critical paths. assert is disabled by default in production
JVM (no -ea flag), silently skipping invariant checks. With -ea enabled,
AssertionError bypasses catch(Exception) handlers, crashing the node.

- RepositoryImpl.increase(): time-backwards check
- RepositoryImpl.calculateGlobalEnergyLimit(): zero energy weight check
- EnergyProcessor.calculateGlobalEnergyLimit(): same, symmetric with
  allowNewReward() early return
- ResourceProcessor.increase(): time-backwards check
- MerklePath.encode(): authentication path / index size mismatch
- TransactionTrace: add zero-check guard before newArea/newSize division
  in resetAccountUsage() and resetAccountUsageV2() to prevent
  ArithmeticException when window size is zero
- ResilienceService: replace all new Random() with
  ThreadLocalRandom.current() to avoid unnecessary object allocation;
  remove Random field from WeightedRandom inner class
- DbConvert: merge parallel List<byte[]> (keys/values) into single
  List<Map.Entry<byte[], byte[]>> to eliminate index-mismatch risk
- Chainbase.head: declare as volatile for cross-thread visibility
- Chainbase.prefixQuery(): snapshot head into local variable at entry
  and pass to prefixQueryRoot()/prefixQuerySnapshot() to guarantee
  consistency within a single call
- TronDatabase.close(): split writeOptions.close() and
  dbSource.closeDB() into separate try-catch blocks so one failure
  does not skip the other
- CheckPointV2Store.close(): close own writeOptions in try-catch then
  call super.close() for full parent cleanup chain
- LevelDB.close() / RocksDB.close(): same split try-catch pattern
- SnapshotManager.checkV2(): wrap getCheckpointDB() in
  try-with-resources so close() runs even if recover() throws
@barbatos2011 barbatos2011 force-pushed the fix/defensive-code-safety branch from c378136 to 4e171df Compare April 1, 2026 20:44
Comment on lines +376 to 380
Snapshot snapshot = localHead;
if (!snapshot.equals(localHead.getRoot())) {
Map<WrappedByteArray, WrappedByteArray> all = new HashMap<>();
((SnapshotImpl) snapshot).collect(all, key);
all.forEach((k, v) -> result.put(k, v.getBytes()));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it appears to bind the query to the caller-selected view via Snapshot snapshot = localHead;, but the subsequent call to ((SnapshotImpl) snapshot).collect(all, key) does not actually collect from localHead. Since SnapshotImpl.collect(...) starts from getRoot().getNext() and walks forward until the latest tip, so the effective scan range is the entire overlay chain after root, not the caller’s resolved snapshot.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. You're right that collect() always walks from getRoot().getNext() to the end of the chain, regardless of which SnapshotImpl instance it's called on. So even with the local snapshot, if head advances during prefixQuery, collect() will scan into the newly appended snapshots.

This can be fixed by adding a boundary parameter to collect():

// SnapshotImpl.java
synchronized void collect(Map<WrappedByteArray, WrappedByteArray> all,
                          byte[] prefix, Snapshot boundary) {
    Snapshot next = getRoot().getNext();
    while (next != null) {
        Streams.stream(((SnapshotImpl) next).db)
            .filter(e -> Bytes.indexOf(
                Objects.requireNonNull(e.getKey().getBytes()), prefix) == 0)
            .forEach(e -> all.put(WrappedByteArray.of(e.getKey().getBytes()),
                WrappedByteArray.of(e.getValue().getBytes())));
        if (next == boundary) {
            break;
        }
        next = next.getNext();
    }
}

Then prefixQuerySnapshot passes localHead as the boundary:

((SnapshotImpl) localHead).collect(all, key, localHead);

That said, two notes:

  1. This is a pre-existing issue, not introduced by this PR. The original code read head directly and collect() walked to the end in exactly the same way. The local snapshot fix in this PR still provides value — it ensures prefixQueryRoot and prefixQuerySnapshot use the same getRoot(), which was the original read-read inconsistency.

  2. Other callers have the same issue. Chainbase.getAll() (line 179) and getNext() (line 325) also call collect() without a boundary. If this is worth fixing, all call sites should be addressed together.

Recommend tracking this as a separate follow-up issue rather than expanding the scope of this PR.

public Map<WrappedByteArray, byte[]> prefixQuery(byte[] key) {
Map<WrappedByteArray, byte[]> result = prefixQueryRoot(key);
Map<WrappedByteArray, byte[]> snapshot = prefixQuerySnapshot(key);
Snapshot localHead = head;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefixQuery() previously depended on cursor-resolved snapshot selection through head(), which supports HEAD, SOLIDITY, and PBFT. The current implementation now bypasses that logic by reading the raw head field directly. As a result, prefixQuery() can diverge from get()/getUnchecked()/has() under non-HEAD cursors.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid issue. The original code used head() in prefixQuerySnapshot which routes through cursor logic (HEAD/SOLIDITY/PBFT), but this PR replaced it with the raw head field, bypassing cursor resolution entirely. Under SOLIDITY or PBFT cursors, prefixQuery would now return unconfirmed head-level data that get()/getUnchecked()/has() would not see.

Fix: change Snapshot localHead = head to Snapshot localHead = head(). This preserves both the cursor semantics and the single-read consistency that the local snapshot was intended to provide.

}

@Test
public void testConcurrentSetHeadAndPrefixQuery() throws Exception {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs content assertions, not just liveness assertions, to validate the advertised fix for prefixQuery() consistency.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The current test only verifies liveness (no exceptions, non-null results), which is insufficient to validate snapshot consistency. Will add content assertions — specifically, verifying that under SOLIDITY cursor, prefixQuery does not return head-only keys.

Note: there's also a pre-existing issue where SnapshotImpl.collect() always walks from getRoot().getNext() to the latest tip regardless of the caller's snapshot boundary. This means even with the cursor fix, prefixQuery overlay results may include data beyond the resolved snapshot. Will document this as a known limitation and track separately.


latch.countDown();
executor.shutdown();
executor.awaitTermination(30, TimeUnit.SECONDS);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add check for executor.awaitTermination(...)'s boolean result, right now if it returns false, that means the 30-second timeout expired and worker threads may still be running, yet the test will still continue into chainbase.reset() and chainbase.close() in the finally block.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. If awaitTermination returns false, worker threads may still be running when the finally block executes reset()/close(), causing a race. Will add shutdownNow() on timeout:

executor.shutdown();
if (!executor.awaitTermination(30, TimeUnit.SECONDS)) {
    executor.shutdownNow();
}

The @Rule Timeout.seconds(60) provides a backstop against hangs, but this ensures clean thread termination before resource cleanup.


if (lastTime != now) {
assert now > lastTime;
if (now < lastTime) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be further confirmed that this branch would not have occurred historically.

@kuny0707 kuny0707 closed this Apr 2, 2026
@kuny0707 kuny0707 reopened this Apr 2, 2026
@kuny0707 kuny0707 closed this Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants