PHOENIX-7751 : [SyncTable Tool] Feature to validate table data using PhoenixSyncTable tool b/w source and target cluster by rahulLiving · Pull Request #2379 · apache/phoenix

rahulLiving · 2026-02-18T15:38:22Z

No description provided.

This reverts commit 3c54c86.

This reverts commit c97f7e0.

This reverts commit fd46404.

…PhoenixSyncTable tool b/w source and target cluster

tkhurana · 2026-03-12T20:13:18Z

...t/src/main/java/org/apache/phoenix/coprocessorclient/BaseScannerRegionObserverConstants.java

+
+  /**
+   * PhoenixSyncTableTool chunk metadata cell qualifiers. These define the wire protocol between
+   * hoenixSyncTableRegionScanner (server-side coprocessor) and PhoenixSyncTableMapper (client-side


Typo missing 'P'

tkhurana · 2026-03-18T14:15:15Z

...ix-core-server/src/main/java/org/apache/phoenix/mapreduce/util/PhoenixConfigurationUtil.java

+
+  public static Long getPhoenixSyncTableFromTime(Configuration conf) {
+    Preconditions.checkNotNull(conf);
+    String value = conf.get(PHOENIX_SYNC_TABLE_FROM_TIME);


Why didn't you use conf.getLong() ?

tkhurana · 2026-03-18T14:15:46Z

...ix-core-server/src/main/java/org/apache/phoenix/mapreduce/util/PhoenixConfigurationUtil.java

+    conf.setLong(PHOENIX_SYNC_TABLE_TO_TIME, toTime);
+  }
+
+  public static Long getPhoenixSyncTableToTime(Configuration conf) {


Here also why didn't you use conf.getLong ?

tkhurana · 2026-03-18T14:18:05Z

...ix-core-server/src/main/java/org/apache/phoenix/mapreduce/util/PhoenixConfigurationUtil.java

    return configuration.getBoolean(MAPREDUCE_RANDOMIZE_MAPPER_EXECUTION_ORDER,
      DEFAULT_MAPREDUCE_RANDOMIZE_MAPPER_EXECUTION_ORDER);
  }
+


IMO these APIs can remain in PhoenixSyncTableTool class only. They are specific to Sync tool

I saw other Tool also has its setter/getter in PhoenixConfigurationUtil.java, so followed same pattern. I am okay to move

By definition util is something which is useful in multiple contexts. I don't think we should follow the same pattern.

tkhurana · 2026-03-20T18:53:45Z

...-core-server/src/main/java/org/apache/phoenix/coprocessor/PhoenixSyncTableRegionScanner.java

+        return false;
+      }
+
+      buildChunkMetadataResult(results, isTargetScan);


If we break out early due to page timeout won't we have a partial chunk ?

It seems that isTargetScan is for different purpose or at-least the naming can be improved.

If we break out early due to page timeout won't we have a partial chunk ?

I have kept source not to have partial chunk, whatever can be processed with page timeout will be considered as source chunk and target will scan with that source chunk size.
Though we can have partial chunk for source, but I was thinking if chunking is taking ~5-10 mins, its better not to hit the same server immediately to let server cool off ?

For target chunk, we always assume target as partial chunk. and caulculates final checksum in Mapper itslef when all rows boundary is read.
That is why isTargetScan is synonymous to partialChunk.

I am not sure what you mean by "not to hit the same server immediately to let server cool off". I would suggest refactoring the code a little bit . isTargetScan is already a field of this class. No need to pass it as a parameter to the buildChunkMetadataResult function. Rather, just directly reference this field inside the function. IMO, the current naming is a little confusing. Also, please add comments on why the target chunk is always assumed as a partial chunk.

I am not sure what you mean by "not to hit the same server immediately to let server cool off"

I meant, If we see page timeout(15 mins) for source cluster, we return with whatever has been collected be it 1 row or 1GB of row. And then look for same rows in target region boundary.
We have an option to continue making source chunk if we see page timeout, until we get 1GB of data or end of region. But I avoided that. Reason being, if we are seeing page timeout, it indicates server is not able to keep up with the request. So instead we go to target cluster to get chunk and validate checksum.
Lets say it took a min to calculate checksum from target cluster, we delayed hitting source RS again by 1 min.

Lets say it took a min to calculate checksum from target cluster, we delayed hitting source RS again by 1 min.

I realized, shall we add an explicit delay to throttle as well at Mapper side if source chunk times out before processing 1GB chunk ?

tkhurana · 2026-03-20T19:43:48Z

...-core-server/src/main/java/org/apache/phoenix/coprocessor/PhoenixSyncTableRegionScanner.java

+  private byte[] chunkStartKey = null;
+  private byte[] chunkEndKey = null;
+  private long currentChunkSize = 0L;
+  private long currentChunkRowCount = 0L;


Improvement can be made here to introduce the notion of a chunk object

tkhurana · 2026-03-20T19:48:14Z

...-core-server/src/main/java/org/apache/phoenix/coprocessor/PhoenixSyncTableRegionScanner.java

+          byte[] rowKey = CellUtil.cloneRow(rowCells.get(0));
+          long rowSize = calculateRowSize(rowCells);
+          addRowToChunk(rowKey, rowCells, rowSize);
+          if (!isTargetScan && willExceedChunkLimits(rowSize)) {


So addRowToChunk is already adding the rowSize to chunkSize and then willExceedChunkLimits is again adding rowSize to chunkSize

tkhurana · 2026-03-20T19:51:57Z

...-core-server/src/main/java/org/apache/phoenix/coprocessor/PhoenixSyncTableRegionScanner.java

+  public boolean next(List<Cell> results, ScannerContext scannerContext) throws IOException {
+    region.startRegionOperation();
+    try {
+      resetChunkState();


If you have a notion of a chunk object then you don't need reset you can simply create a new chunk

tkhurana · 2026-03-20T20:01:35Z

...t/src/main/java/org/apache/phoenix/coprocessorclient/BaseScannerRegionObserverConstants.java

+  /**
+   * PhoenixSyncTableTool scan attributes for server-side chunk formation and checksum
+   */
+  public static final String SYNC_TABLE_CHUNK_FORMATION = "_SyncTableChunkFormation";


Should all of these instead be named SYNC_TOOL ?

I have used SyncTableTool for user facing class/config. For others, I have used SyncTable, are you recommending to move all Classes and config to SyncTool instead of SyncTable i.e PhoenixSyncTableRegionScanner -> PhoenixSyncToolRegionScanner ?
I felt SyncTable is more self explainable compared to SyncTool, we can also change it to SyncTableTool at all places ?

I see. Its okay. Not a big deal. We can stick with the same naming convention.

tkhurana · 2026-03-20T21:31:23Z

...-core-server/src/main/java/org/apache/phoenix/coprocessor/PhoenixSyncTableRegionScanner.java

+            if (chunkStartKey == null) {
+              LOGGER.warn("Paging timed out while fetching first row of chunk, initStartRowKey: {}",
+                Bytes.toStringBinary(initStartRowKey));
+              updateDummyWithPrevRowKey(results, initStartRowKey, includeInitStartRowKey, scan);
+              return true;


Is this ever hit ? Even with 0 page timeout we get at least one row

Yeah, I was not able repro it in my Integration test. Kept it as defensive check.

Even with 0 page timeout we get at least one row

what would this row contain, if we couldn't get any row from table ?

Either we will get some exception or we will get a row. You can simplify your code by not handling this case. I think this will also get rid of updateDummyWithPrevRowKey function as it seems this is the only place which is calling this function.

I see, we can remove handling of dummy rows in SyncTableMapper as well then.

tkhurana · 2026-03-21T00:21:49Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableMapper.java

+  @Override
+  protected void map(NullWritable key, DBInputFormat.NullDBWritable value, Context context)
+    throws IOException, InterruptedException {
+    context.getCounter(PhoenixJobCounters.INPUT_RECORDS).increment(1);


What is the meaning of INPUT_RECORDS in the context of sync tool ?

It indicates number of mappers created

Why do you need a Phoenix specific counter for it. The Map reduce framework already tells us the number of mappers.

Via this we are also maintaining counters for mappers created, mappers with no failed chunk and mapper with failed chunk.
Based on your suggestion on other comment., we can rename it to VERIFIED_MAPPER, FAILED_MAPPER.

tkhurana · 2026-03-21T00:22:26Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableMapper.java

+
+      if (sourceRowsProcessed > 0) {
+        if (mismatchedChunk == 0) {
+          context.getCounter(PhoenixJobCounters.OUTPUT_RECORDS).increment(1);


What does the OUTPUT_RECORDS mean in the context of Sync tool ?

Number of mapper sucessfully processed. We also have FAILED_RECORD for failed mappers.

RECORDS generally implies rows. One suggestion could be to use VERIFIED_CHUNKS, FAILED_CHUNKS

tkhurana · 2026-03-21T00:45:01Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+      + "    TO_TIME BIGINT NOT NULL,\n" + "    START_ROW_KEY VARBINARY_ENCODED,\n"
+      + "    END_ROW_KEY VARBINARY_ENCODED,\n" + "    IS_DRY_RUN BOOLEAN, \n"
+      + "    EXECUTION_START_TIME TIMESTAMP,\n" + "    EXECUTION_END_TIME TIMESTAMP,\n"
+      + "    STATUS VARCHAR(20),\n" + "    COUNTERS VARCHAR(255), \n"


I don't think Counters should have a fixed limit. Just make them VARCHAR so that we can add more counters in the future.

I think we should also add tenantId as one of PK column.

tkhurana · 2026-03-21T00:48:04Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRow.java

+
+  public enum Type {
+    CHUNK,
+    MAPPER_REGION


maybe just REGION

tkhurana · 2026-03-21T00:51:42Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+
+    String query = "SELECT START_ROW_KEY, END_ROW_KEY FROM " + SYNC_TABLE_CHECKPOINT_TABLE_NAME
+      + " WHERE TABLE_NAME = ?  AND TARGET_CLUSTER = ?"
+      + " AND TYPE = ? AND FROM_TIME = ? AND TO_TIME = ? AND STATUS IN ( ?, ?)";


There are only 2 possible status so does it make sense to set them in the query ? If you don't then you are only querying pk columns without any filter.

tkhurana · 2026-03-23T22:24:27Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableTool.java

+    qSchemaName = SchemaUtil.normalizeIdentifier(schemaName);
+    PhoenixMapReduceUtil.validateTimeRange(startTime, endTime, qTable);
+    PhoenixMapReduceUtil.validateMaxLookbackAge(configuration, endTime, qTable);
+    if (LOGGER.isDebugEnabled()) {


Let's move this log to INFO level. It will be useful.

tkhurana · 2026-03-23T22:25:39Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableTool.java

+    formatter.printHelp("hadoop jar phoenix-server.jar " + PhoenixSyncTableTool.class.getName(),
+      "Synchronize a Phoenix table between source and target clusters", options,
+      "\nExample usage:\n"
+        + "hadoop jar phoenix-server.jar org.apache.phoenix.mapreduce.PhoenixSyncTableTool \\\n"


Generally we run IndexTool via /hbase/bin/hbase IndexTool.

tkhurana · 2026-03-23T22:33:11Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableTool.java

+    qTable = SchemaUtil.getQualifiedTableName(schemaName, tableName);
+    qSchemaName = SchemaUtil.normalizeIdentifier(schemaName);
+    PhoenixMapReduceUtil.validateTimeRange(startTime, endTime, qTable);
+    PhoenixMapReduceUtil.validateMaxLookbackAge(configuration, endTime, qTable);


Do we need the end time to be within the max lookback window ? How will the sync tool break if the end time is outside of the max lookback window ?

Right, this check is not useful.

On the other hand, we should not only enforce that it is not outside the window, we should also enforce a "safety buffer" to accommodate the data in flight. Even when the endTime is with in the window, if it is too close to the current time, it may miss the data that is still in flight and may cause false positives. In practice, this may not matter as the time it takes to setup and run could be in the order of several minutes and so enough for the catch up to complete, but I think it is better to make it explicit by enforcing a safety buffer and make this more deterministic.

If we remove this check and allow the endTime to be in the future, the possibility of having false positives due to the data in flight becomes a lot more pronounced. By enforcing both startTime and endTIme, we can ensure a "consistent window" where data is guaranteed to be fully replicated and 'quiesced' on both sides. WDYT?

I was thinking more about the "consistent window" or "quiesced window" approach that I suggested above and realized this is actually a race against sliding window during long-running jobs.

If a sync job takes several hours to complete, a startTime that was valid at the beginning of the job might actually 'slide' out of the lookback window by the time the final Mappers execute. Since HBase compactions on the Source and Target clusters aren't synchronized, couldn't this lead to false-positive mismatches if one cluster purges historical data mid-run while the other hasn't yet?

It may not always be possible to make the "Safety Buffer" on the startTime large enough to account for the job execution time, what if the max lookback window is only a few hours and the job itself takes hours? Does this require utilizing HBase Snapshots to 'freeze' the data state for the duration of the sync? Are there existing pattern that other systems might have employed to handle this issue?

@kadirozde @tkhurana

We need to think about this from two perspective, where we run the sync job regularly as cron, secondly if we use this for migration validation.
For migration validation, start time would definitely before maxLookbackAge. It is upto the owner if they want to validate all version and delete markers or just latest version.
For regular cron job to be used in PhoenixHA, we can configure the start/end time to be within maxLookBackAge.
Tanuj suggestion of giving user flexibility to choose rawScan & allVersion option would be helpful. And since we plan to fix the mismatched rows as well, we can consider source as SOT and fix accordingly.
Though, there can be instance where it can't be fixed like source have removed delete marker via compaction but target still has delete marker. Such rows can be flagged as not fixable as per design.

Btw, default endTime is (currentTime - 1 hour), to ensure target has the desired data.

tkhurana · 2026-03-23T23:23:16Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableTool.java

+      PhoenixConfigurationUtil.setPhoenixSyncTableChunkSizeBytes(configuration, chunkSizeBytes);
+    }
+    if (tenantId != null) {
+      PhoenixConfigurationUtil.setTenantId(configuration, tenantId);


Can you verify if the tenantid is being correctly set as a key prefix on the scan ?

If you have a table region with multiple tenants and we pass a tenant id then our scan range should start with the tenantid prefix.

Yes, it only create input ranges and scan for tenant specific rows. We have an IT for same

tkhurana · 2026-03-24T20:28:29Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/util/PhoenixMapReduceUtil.java

+   * Configures a Configuration object with ZooKeeper settings from a ZK quorum string.
+   * @param baseConf Base configuration to create from (typically job configuration)
+   * @param zkQuorum ZooKeeper quorum string in format: "zk_quorum:port:znode" Example:
+   *                 "zk1,zk2,zk3:2181:/hbase"


This is actually not the only format for zk quorum. There are other valid formats also where the port number is specified separately for each server. There is actually a very useful API in Hbase called HBaseConfiguration.createClusterConf(job.getConfiguration(), targetZkQuorum) We should use that as that also works for zk registry.

tkhurana · 2026-03-24T20:51:14Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+
+    String query = "SELECT START_ROW_KEY, END_ROW_KEY FROM " + SYNC_TABLE_CHECKPOINT_TABLE_NAME
+      + " WHERE TABLE_NAME = ?  AND TARGET_CLUSTER = ?"
+      + " AND TYPE = ? AND FROM_TIME = ? AND TO_TIME = ? AND STATUS IN ( ?, ?)";


I am not 100% positive that you can assume that the output of this query is always sorted by row key. You might have to add an ORDER BY clause here. If you are adding an ORDER BY clause it will be better to add all the PK columns to make the sorting more efficient.

tkhurana · 2026-03-24T20:53:08Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableInputFormat.java

+    int completedIdx = 0;
+
+    // Two pointer comparison across splitRange and completedRange
+    while (splitIdx < allSplits.size() && completedIdx < completedRegions.size()) {


I think you are assuming here that completedRegions is already sorted. Please see my comment on the getProcessedMapperRegions function.

Won't the results be sorted in the PK order already? I see that the new commit adds ORDER BY, but not sure why that is required.

tkhurana · 2026-03-24T21:00:54Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableInputFormat.java

+      PhoenixInputSplit split = (PhoenixInputSplit) allSplits.get(splitIdx);
+      KeyRange splitRange = split.getKeyRange();
+      KeyRange completedRange = completedRegions.get(completedIdx);
+      byte[] splitStart = splitRange.getLowerRange();


Will the end key of the split range will always be exclusive ? If yes, can you please add a comment

Yes, both splitRange and completedRange, start key would be inclusive and endKey exclusive always. Will add a comment.

tkhurana · 2026-03-24T23:49:59Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableMapper.java

+   * @return List of (startKey, endKey) pairs representing unprocessed ranges
+   */
+  @VisibleForTesting
+  public List<Pair<byte[], byte[]>> calculateUnprocessedRanges(byte[] mapperRegionStart,


Maybe we could return a List<KeyRange>

tkhurana · 2026-03-25T00:10:37Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+    if (hasStartBoundary) {
+      queryBuilder.append(" AND END_ROW_KEY >= ?");
+    }
+    queryBuilder.append(" AND STATUS IN (?, ?)");


Same as above we don't need to pass status

tkhurana · 2026-03-25T00:20:54Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableMapper.java

+    scan.setCacheBlocks(false);
+    scan.setTimeRange(fromTime, toTime);
+    if (isTargetScan) {
+      scan.setLimit(1);


Can you add a comment here why we are setting limit to 1 and caching to 1

tkhurana · 2026-03-25T00:53:07Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableMapper.java

+    Scan scan = new Scan();
+    scan.withStartRow(startKey, isStartKeyInclusive);
+    scan.withStopRow(endKey, isEndKeyInclusive);
+    scan.setRaw(true);


Are we sure we have to do raw scan ?

Also, can we make this configurable via the SyncTool commandl ine

tkhurana · 2026-03-25T00:54:05Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableMapper.java

+    scan.withStartRow(startKey, isStartKeyInclusive);
+    scan.withStopRow(endKey, isEndKeyInclusive);
+    scan.setRaw(true);
+    scan.readAllVersions();


Same can we make the behavior of reading all versions configurable.

tkhurana · 2026-03-25T01:02:01Z

phoenix-core/src/it/java/org/apache/phoenix/end2end/PhoenixSyncTableToolIT.java

@@ -0,0 +1,2267 @@
+/*


Can you add a test where rows are deleted on both the source and target tables but you have run compaction on only one. We can have actually 2 cases where compaction is run on the source but not on target and vice versa. I saw that you are doing raw scan. Maxlookback settings will also impact this.

haridsv

I just skimmed through and left some comments at the surface level.

haridsv · 2026-03-25T15:07:30Z

phoenix-core-client/src/main/java/org/apache/phoenix/util/SHA256DigestUtil.java

+  public static byte[] encodeDigestState(SHA256Digest digest) {
+    byte[] encoded = digest.getEncodedState();
+    ByteBuffer buffer = ByteBuffer.allocate(Bytes.SIZEOF_INT + encoded.length);
+    buffer.putInt(encoded.length);
+    buffer.put(encoded);
+    return buffer.array();
+  }


Since MAX_SHA256_DIGEST_STATE_SIZE is capped at 128 bytes , using a 4-byte integer and ByteBuffer for the length prefix is slightly over-engineered. We can optimize this by using a single byte for the length and Bytes.add() for concatenation. This would allow us to remove the ByteBuffer, ByteArrayInputStream, and DataInputStream dependencies in these utility methods.

Suggested change

public static byte[] encodeDigestState(SHA256Digest digest) {

byte[] encoded = digest.getEncodedState();

ByteBuffer buffer = ByteBuffer.allocate(Bytes.SIZEOF_INT + encoded.length);

buffer.putInt(encoded.length);

buffer.put(encoded);

return buffer.array();

}

public static byte[] encodeDigestState(SHA256Digest digest) {

byte[] encoded = digest.getEncodedState();

// Use an unsigned byte as 128 > Byte.MAX_VALUe

return Bytes.add(new byte[]{(byte) (encoded.length & 0xff)}, encoded);

}

BTW, can you tell me why we need to encode the length into it? You are using PhoenixKeyValueUtil.newKeyValue which is already encoding the length of the byte[] anyway.

haridsv · 2026-03-25T16:59:57Z

phoenix-core-client/src/main/java/org/apache/phoenix/util/SHA256DigestUtil.java

+public class SHA256DigestUtil {
+
+  /**
+   * Maximum allowed size for encoded SHA-256 digest state. SHA-256 state is ~96 bytes, we allow up


Can you point me to the documentation on the size being ~96 bytes?

I didn't verify this, but DeepWiki says it can be up to 309 bytes: https://deepwiki.com/search/is-there-an-upper-limit-to-the_7872e61f-4f3f-462e-b4e9-cb6cbed47bd8?mode=fast

haridsv · 2026-03-26T14:45:05Z

phoenix-core-client/src/main/java/org/apache/phoenix/util/SHA256DigestUtil.java

+    DataInputStream dis = new DataInputStream(new ByteArrayInputStream(encodedState));
+    int stateLength = dis.readInt();
+    // Prevent malicious large allocations
+    if (stateLength > MAX_SHA256_DIGEST_STATE_SIZE) {
+      throw new IllegalArgumentException(
+        String.format("Invalid SHA256 state length: %d, expected <= %d", stateLength,
+          MAX_SHA256_DIGEST_STATE_SIZE));
+    }
+
+    byte[] state = new byte[stateLength];
+    dis.readFully(state);


Following my suggestion in encode, this will simply become:

Suggested change

DataInputStream dis = new DataInputStream(new ByteArrayInputStream(encodedState));

int stateLength = dis.readInt();

// Prevent malicious large allocations

if (stateLength > MAX_SHA256_DIGEST_STATE_SIZE) {

throw new IllegalArgumentException(

String.format("Invalid SHA256 state length: %d, expected <= %d", stateLength,

MAX_SHA256_DIGEST_STATE_SIZE));

}

byte[] state = new byte[stateLength];

dis.readFully(state);

int stateLength = encodedState[0] & 0xff;

// Prevent malicious large allocations

if (stateLength > MAX_SHA256_DIGEST_STATE_SIZE) {

throw new IllegalArgumentException(

String.format("Invalid SHA256 state length: %d, expected <= %d", stateLength,

MAX_SHA256_DIGEST_STATE_SIZE));

}

byte[] state = new byte[stateLength];

System.arraycopy(encodedState, 1, state, 0, stateLength);

haridsv · 2026-03-26T15:01:54Z

...t/src/main/java/org/apache/phoenix/coprocessorclient/BaseScannerRegionObserverConstants.java

+   */
+  public static final String SYNC_TABLE_CHUNK_FORMATION = "_SyncTableChunkFormation";
+  public static final String SYNC_TABLE_CHUNK_SIZE_BYTES = "_SyncTableChunkSizeBytes";
+  public static final String SYNC_TABLE_CONTINUED_DIGEST_STATE = "_SyncTableContinuedDigestState";


Can you add JavaDoc on all 3 constants individually with a description of what they the attribute is and what type of value it would contain?

haridsv · 2026-03-26T15:07:15Z

phoenix-core-client/src/main/java/org/apache/phoenix/util/ScanUtil.java

    return scan.getAttribute((BaseScannerRegionObserverConstants.REBUILD_INDEXES)) != null;
  }

+  public static boolean isSyncTableChunkFormation(Scan scan) {


Do you mean isSyncTableChunkFormationEnabled?

haridsv · 2026-03-26T15:39:03Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableMapper.java

+      connectToTargetCluster();
+      globalConnection = createGlobalConnection(conf);
+      syncTableOutputRepository = new PhoenixSyncTableOutputRepository(globalConnection);
+    } catch (SQLException | IOException e) {


I would include RuntimeException too, to be more aggressive in avoiding a resource leak.

haridsv · 2026-03-26T15:43:26Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableMapper.java

+        Bytes.toBytes(chunkSizeBytes));
+    }
+    long syncTablePageTimeoutMs = (long) (conf.getLong(HConstants.HBASE_RPC_TIMEOUT_KEY,
+      QueryServicesOptions.DEFAULT_SYNC_TABLE_RPC_TIMEOUT) * 0.5);


What is the basis for this multiplier? Should it be a configurable value?

haridsv · 2026-03-26T16:10:07Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+      queryBuilder.append(" AND START_ROW_KEY <= ?");
+    }
+    if (hasStartBoundary) {
+      queryBuilder.append(" AND END_ROW_KEY >= ?");


For last region, there will be no constraint on START_ROW_KEY and END_ROW_KEY is not part of the PK, so this can perform poorly, I think.

Perhaps we should add " AND START_ROW_KEY > <0x00>"? You may want to check the query plan for with and without this constraint to see if it is helping.

haridsv · 2026-03-26T16:10:54Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRow.java

+  private Boolean isDryRun;
+  private byte[] startRowKey;
+  private byte[] endRowKey;
+  private Boolean isFirstRegion;


Is this being used?

haridsv · 2026-03-26T16:56:47Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRow.java

+    return parseCounterValue(PhoenixSyncTableMapper.SyncCounters.TARGET_ROWS_PROCESSED.name());
+  }
+
+  @VisibleForTesting


On a private function?

haridsv · 2026-03-27T14:03:06Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableTool.java

+    qTable = SchemaUtil.getQualifiedTableName(schemaName, tableName);
+    qSchemaName = SchemaUtil.normalizeIdentifier(schemaName);
+    PhoenixMapReduceUtil.validateTimeRange(startTime, endTime, qTable);
+    PhoenixMapReduceUtil.validateMaxLookbackAge(configuration, endTime, qTable);


Don't we need to validate that the startTime is within the max lookback window? If startTime is beyond the window, background compactions on the source and target clusters may have purged different sets of historical versions or delete markers. Since these compactions don't run in sync, the tool will see different data states on each cluster, leading to false-positive mismatches for data that is actually consistent but has simply been cleaned up at different times.

The question is do we want to verify the full history or just the latest version ? If we don't do raw scans and don't fetch all the versions then we don't have to deal with all the complexity around max lookback window and compactions. It will also make the tool run faster. That is why I suggested making those configurable. I feel we don't really need to look at the entire history. As long as the end time is (current time - some configurable lag) we can get a consistent snapshot of the source and target.

I just saw your other comments on setRaw(). I agree, if we don't do raw scan, then we don't have to worry about these aspects, but if the HA guarantee includes the history (e.g., CDC), then we do we have a choice?

haridsv · 2026-03-27T14:14:06Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableTool.java

+    qTable = SchemaUtil.getQualifiedTableName(schemaName, tableName);
+    qSchemaName = SchemaUtil.normalizeIdentifier(schemaName);
+    PhoenixMapReduceUtil.validateTimeRange(startTime, endTime, qTable);
+    PhoenixMapReduceUtil.validateMaxLookbackAge(configuration, endTime, qTable);


On the other hand, we should not only enforce that it is not outside the window, we should also enforce a "safety buffer" to accommodate the data in flight. Even when the endTime is with in the window, if it is too close to the current time, it may miss the data that is still in flight and may cause false positives. In practice, this may not matter as the time it takes to setup and run could be in the order of several minutes and so enough for the catch up to complete, but I think it is better to make it explicit by enforcing a safety buffer and make this more deterministic.

If we remove this check and allow the endTime to be in the future, the possibility of having false positives due to the data in flight becomes a lot more pronounced. By enforcing both startTime and endTIme, we can ensure a "consistent window" where data is guaranteed to be fully replicated and 'quiesced' on both sides. WDYT?

haridsv · 2026-03-27T14:19:44Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/util/PhoenixMapReduceUtil.java

+   * @throws SQLException             if connection fails
+   * @throws IllegalArgumentException if validation fails
+   */
+  public static PTable validateTableForMRJob(Connection connection, String qualifiedTableName,


How about?

Suggested change

public static PTable validateTableForMRJob(Connection connection, String qualifiedTableName,

public static PTable getPTableWithValidation(Connection connection, String qualifiedTableName,

haridsv · 2026-03-27T14:59:45Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+    try (PreparedStatement ps = connection.prepareStatement(UPSERT_CHECKPOINT_SQL)) {
+      ps.setString(1, row.getTableName());
+      ps.setString(2, row.getTargetCluster());
+      ps.setString(3, row.getType().name());


I would recommend storing a byte code rather a long string to reduce the size of the row key.

haridsv · 2026-03-27T15:00:43Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+      ps.setBoolean(9, row.getIsDryRun());
+      ps.setTimestamp(10, row.getExecutionStartTime());
+      ps.setTimestamp(11, row.getExecutionEndTime());
+      ps.setString(12, row.getStatus() != null ? row.getStatus().name() : null);


Similar to type, though not as important, it would be better to store a code rather than the name.

haridsv · 2026-03-27T17:02:33Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+    }
+    if (row.getFromTime() == null || row.getToTime() == null) {
+      throw new IllegalArgumentException("FromTime and ToTime cannot be null for checkpoint");
+    }


I see you are already using Preconditions in a few other places, why not use here and a few other places and make it consistent and less verbose?

Edited the comment to add the missing Preconditions.

haridsv · 2026-03-27T17:04:42Z

phoenix-core/src/test/java/org/apache/phoenix/mapreduce/util/PhoenixConfigurationUtilTest.java

    sourceTable = PhoenixConfigurationUtil.getIndexToolSourceTable(conf);
    Assert.assertEquals(sourceTable, SourceTable.DATA_TABLE_SOURCE);
  }
+


Inadvertent change?

haridsv · 2026-03-27T17:08:49Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+      + "    COUNTERS VARCHAR, \n" + "    CONSTRAINT PK PRIMARY KEY (\n" + "        TABLE_NAME,\n"
+      + "        TARGET_CLUSTER,\n" + "        TYPE ,\n" + "        FROM_TIME,\n"
+      + "        TO_TIME,\n" + "        TENANT_ID,\n" + "        START_ROW_KEY )" + ") TTL="
+      + OUTPUT_TABLE_TTL_SECONDS;


Why is UPSERT_CHECKPOINT_SQL a static field but this one is not? It seems inconsistent.

We might update schema with ALTER statement in future as per requirement. So, I thought of keeping all at one place.

haridsv · 2026-03-27T17:14:25Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+    try (Statement stmt = connection.createStatement()) {
+      stmt.execute(ddl);
+      connection.commit();
+      LOGGER.info("Successfully created or verified existence of {} table",


No verification is being done, perhaps you want to say something like this?

Suggested change

LOGGER.info("Successfully created or verified existence of {} table",

LOGGER.info("Initialization of checkpoint table {} complete",

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

haridsv · 2026-03-28T05:05:45Z

...e-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableCheckpointOutputRow.java

+      }
+    }
+    return 0;
+  }


The counters string is generated outside the class, while parsing is done inside, this mismatch in abstraction is not good. I suggest you either encapsulate both in this class or move both into one util class.

haridsv · 2026-03-28T05:12:22Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableTool.java

+    qTable = SchemaUtil.getQualifiedTableName(schemaName, tableName);
+    qSchemaName = SchemaUtil.normalizeIdentifier(schemaName);
+    PhoenixMapReduceUtil.validateTimeRange(startTime, endTime, qTable);
+    PhoenixMapReduceUtil.validateMaxLookbackAge(configuration, endTime, qTable);


I just saw your other comments on setRaw(). I agree, if we don't do raw scan, then we don't have to worry about these aspects, but if the HA guarantee includes the history (e.g., CDC), then we do we have a choice?

haridsv · 2026-03-28T06:35:06Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+      queryBuilder.append(" AND START_ROW_KEY <= ?");
+    }
+    if (hasStartBoundary) {
+      queryBuilder.append(" AND END_ROW_KEY >= ?");


Perhaps we should add " AND START_ROW_KEY > <0x00>"? You may want to check the query plan for with and without this constraint to see if it is helping.

haridsv · 2026-03-28T06:38:23Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+    }
+
+    queryBuilder.append(
+      " ORDER BY TABLE_NAME, TARGET_CLUSTER, TYPE, FROM_TIME, TO_TIME, TENANT_ID, START_ROW_KEY");


This is already the PK order, correct? Why do we need an explicit ORDER BY?

haridsv · 2026-03-28T06:40:28Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+    }
+
+    queryBuilder.append(
+      " ORDER BY TABLE_NAME, TARGET_CLUSTER, TYPE, FROM_TIME, TO_TIME, TENANT_ID, START_ROW_KEY");


Same question as above on the need for ORDER BY.

haridsv · 2026-03-28T06:43:26Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+    }
+    if (row.getFromTime() == null || row.getToTime() == null) {
+      throw new IllegalArgumentException("FromTime and ToTime cannot be null for checkpoint");
+    }


Edited the comment to add the missing Preconditions.

haridsv · 2026-03-28T06:58:00Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableInputFormat.java

+    int completedIdx = 0;
+
+    // Two pointer comparison across splitRange and completedRange
+    while (splitIdx < allSplits.size() && completedIdx < completedRegions.size()) {


Won't the results be sorted in the PK order already? I see that the new commit adds ORDER BY, but not sure why that is required.

haridsv · 2026-03-28T07:01:34Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableInputFormat.java

+    String targetZkQuorum = PhoenixSyncTableTool.getPhoenixSyncTableTargetZkQuorum(conf);
+    Long fromTime = PhoenixSyncTableTool.getPhoenixSyncTableFromTime(conf);
+    Long toTime = PhoenixSyncTableTool.getPhoenixSyncTableToTime(conf);
+    List<InputSplit> allSplits = super.getSplits(context);


Why not cast once here and avoid casting at multiple places later?

haridsv · 2026-03-28T09:03:08Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableMapper.java

+   */
+  private Connection createGlobalConnection(Configuration conf) throws SQLException {
+    Configuration globalConf = new Configuration(conf);
+    globalConf.unset(PhoenixConfigurationUtil.MAPREDUCE_TENANT_ID);


Isn't tenant ID optional? Perhaps you can use the same connection for both if it is not present?

Rahul Kumar and others added 22 commits August 1, 2025 20:52

connection creation time

3c54c86

Revert "connection creation time"

c97f7e0

This reverts commit 3c54c86.

Revert "Revert "connection creation time""

53e9a3b

This reverts commit c97f7e0.

Merge remote-tracking branch 'upstream/master'

6b75fec

Merge remote-tracking branch 'upstream/master'

6f40ab4

Merge remote-tracking branch 'upstream/master'

7328f93

ITs changes

fd46404

Revert "ITs changes"

58ef6a9

This reverts commit fd46404.

Merge remote-tracking branch 'upstream/master'

6f226f6

PHOENIX-7751 : [SyncTable Tool] Feature to validate table data using …

1ccf4b6

…PhoenixSyncTable tool b/w source and target cluster

revert other changes

e75c6c1

checkstyle fix

a5060ab

checkstyle fix

cffd2e6

checkstyle fix

2ef30e6

adding more ITs

dd18dae

adding more ITs

326e792

misc fix

b7127cc

code comment

f588291

code comment formatting

f81aa56

Adding all UT/ITs

d60104f

Fix tests

359f345

Fix tests

1bcd693

rahulLiving marked this pull request as ready for review March 12, 2026 12:36

Rahul Kumar added 2 commits March 12, 2026 18:08

Merge remote-tracking branch 'upstream/master' into PHOENIX-7751

7904c50

PhoenixConfigurationUtilTest

b9dfd3c

tkhurana reviewed Mar 12, 2026

View reviewed changes

Rahul Kumar added 2 commits March 13, 2026 19:28

Fix build issues

6c50f95

Some More UTs

b8c00e4

tkhurana reviewed Mar 18, 2026

View reviewed changes

tkhurana reviewed Mar 20, 2026

View reviewed changes

tkhurana reviewed Mar 21, 2026

View reviewed changes

tkhurana reviewed Mar 23, 2026

View reviewed changes

tkhurana reviewed Mar 24, 2026

View reviewed changes

tkhurana reviewed Mar 25, 2026

View reviewed changes

haridsv reviewed Mar 26, 2026

View reviewed changes

Address review comments

d54f970

haridsv reviewed Mar 27, 2026

View reviewed changes

haridsv reviewed Mar 28, 2026

View reviewed changes

	public static PTable validateTableForMRJob(Connection connection, String qualifiedTableName,
	public static PTable getPTableWithValidation(Connection connection, String qualifiedTableName,

	LOGGER.info("Successfully created or verified existence of {} table",
	LOGGER.info("Initialization of checkpoint table {} complete",

Conversation

rahulLiving commented Feb 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkhurana Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkhurana Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkhurana Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahulLiving Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkhurana Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

tkhurana Mar 18, 2026 •

edited

Loading

tkhurana Mar 18, 2026 •

edited

Loading

tkhurana Mar 20, 2026 •

edited

Loading

rahulLiving Mar 22, 2026 •

edited

Loading

tkhurana Mar 23, 2026 •

edited

Loading

rahulLiving Mar 24, 2026 •

edited

Loading

rahulLiving Mar 24, 2026 •

edited

Loading