Skip to content

HBASE-30049 RestoreSnapshotHelper creates StoreFileTracker with wrong…#8013

Open
bjomobo wants to merge 2 commits into
apache:masterfrom
bjomobo:HBASE-30049-fix-restore-sft
Open

HBASE-30049 RestoreSnapshotHelper creates StoreFileTracker with wrong…#8013
bjomobo wants to merge 2 commits into
apache:masterfrom
bjomobo:HBASE-30049-fix-restore-sft

Conversation

@bjomobo
Copy link
Copy Markdown

@bjomobo bjomobo commented Mar 31, 2026

Description

RestoreSnapshotHelper.restoreRegion() creates a StoreFileTracker using the raw Master conf which lacks table-level settings like hbase.store.file-tracker.impl=FILE. This causes DefaultStoreFileTracker to be used, whose doSetStoreFiles() is a no-op. The .filelist is never updated after restore, leading to FileNotFoundException when regions try to open files that were archived.

Regression introduced by HBASE-28564.

Changes

  • Merge table descriptor config via StoreUtils.createStoreConfiguration() before creating the tracker in restoreRegion()
  • Move tracker creation inside the snapshotFamilyFiles != null check to avoid NullPointerException on families being removed
  • Add withColumnFamilyDescriptor() to the "Add families not present in the table" code path

New Tests

  • TestRestoreSnapshotProcedureFileBasedSFT — end-to-end restore with FILE tracker
  • TestRestoreSnapshotHelperWithFileBasedSFT — unit-level .filelist verification
  • TestRestoreSnapshotFileTrackerTableLevel — table-level FILE tracker with compaction and multi-family restore

Jira: https://issues.apache.org/jira/browse/HBASE-30049

@bjomobo bjomobo force-pushed the HBASE-30049-fix-restore-sft branch 2 times, most recently from e3e3c55 to 6af91ac Compare April 2, 2026 18:52
… config causing no-op filelist updates

RestoreSnapshotHelper.restoreRegion() creates a StoreFileTracker using
the raw Master Configuration object, which does not contain table-level
settings like hbase.store.file-tracker.impl=FILE. This causes
DefaultStoreFileTracker to be instantiated, whose doSetStoreFiles() is
a complete no-op. The .filelist is never updated after the restore moves
HFiles to the archive and creates link files for the snapshot's HFiles.

When a region subsequently opens, the stale .filelist references HFiles
that were moved to the archive, resulting in FileNotFoundException and
the region getting stuck in OPENING state indefinitely.

This is a regression introduced by HBASE-28564, which refactored
reference file creation to go through the StoreFileTracker interface.
The cloneRegion() method in the same commit correctly merges the table
descriptor config via StoreUtils.createStoreConfiguration() before
creating the tracker, but restoreRegion() was missed.

The fix applies the same pattern: merge the table descriptor and column
family descriptor configuration into the Configuration object before
passing it to StoreFileTrackerFactory.create(). This ensures the
correct StoreFileTracker implementation is resolved based on the
table-level setting.

Both locations in restoreRegion() are fixed:
1. For existing families already on disk
2. For new families added from the snapshot
@bjomobo bjomobo force-pushed the HBASE-30049-fix-restore-sft branch from 6af91ac to 282757c Compare April 6, 2026 15:37
}

/**
* Test that reproduces the scenario where restore fails with FileNotFoundException: 1. Create
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this test and TestRestoreSnapshotFileTrackerTableLevel.java have some overlapping test scenarios, can you check once, or am I missing something here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestRestoreSnapshotProcedureFileBasedSFT tests the end-to-end procedure path (single region, simple data change + compaction scenarios), while TestRestoreSnapshotFileTrackerTableLevel uses pre-split regions (4 regions), proper compaction wait via UTIL.waitFor(), and covers the multi-family restore path (add/remove column families). Happy to consolidate if you feel strongly, but I kept them separate since they exercise slightly different things

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

admin.restoreSnapshot(snapshotName); in TestRestoreSnapshotFileTrackerTableLevel also creates procedure know? but its a nit only

Copy link
Copy Markdown
Contributor

@gvprathyusha6 gvprathyusha6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

/**
* Test that reproduces the scenario where restore fails with FileNotFoundException: 1. Create
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

admin.restoreSnapshot(snapshotName); in TestRestoreSnapshotFileTrackerTableLevel also creates procedure know? but its a nit only

Comment on lines +512 to +513
// Family exists in the snapshot, create tracker with merged table descriptor config
// so that table-level settings are picked up.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's make clear that tableDesc here means the one recovered from the snapshot manifest. This means if current SFT config in the table is different from the snapshot one, it will be overwritten.

@wchevreuil
Copy link
Copy Markdown
Contributor

@bjomobo , looks like this needs some code style polishing:

mvninstall 1m 28s root in the patch failed.
compile 0m 36s hbase-server in the patch failed.
⚠️ javac 0m 36s hbase-server in the patch failed.
blanks 0m 0s The patch has no blanks issues.
⚠️ checkstyle 0m 47s hbase-server: The patch generated 2 new + 7 unchanged - 0 fixed = 9 total (was 7)
spotbugs 0m 36s hbase-server in the patch failed.
hadoopcheck 1m 40s The patch causes 20 errors with Hadoop v3.3.6.
hadoopcheck 3m 14s The patch causes 20 errors with Hadoop v3.4.2.
spotless 0m 32s patch has 29 errors when running spotless:check, run spotless:apply to fix.

Can you check on the spotless, checkstyles and spotbugs errors?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants