Skip to content

Next-gen columnar: BIGINT clustered PK at INT64_MAX missing on TiFlash read #10852

@JaySon-Huang

Description

@JaySon-Huang

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Prerequisites:

  • TiDB cluster with next-gen / disaggregated storage: TiKV + tikv-worker + TiFlash compute, with columnar enabled (ENABLE_NEXT_GEN=1, ENABLE_NEXT_GEN_COLUMNAR=1).
  • TiFlash replica on the test table (AVAILABLE = 1).

SQL aligns with tests/fullstack-test2/clustered_index/query.test (int handle section, line 37).

DROP TABLE IF EXISTS test.t_1;

CREATE TABLE test.t_1 (
    a   BIGINT PRIMARY KEY CLUSTERED,
    col INT
);

INSERT INTO test.t_1 VALUES
    (-9223372036854775808, 1),
    (9223372036854775807, 2),
    (0, 3);

ALTER TABLE test.t_1 SET TIFLASH REPLICA 1;

-- Wait until information_schema.tiflash_replica.AVAILABLE = 1 for test.t_1

SET SESSION tidb_isolation_read_engines = 'tiflash';

-- query.test:37
SELECT * FROM test.t_1 WHERE a > -9223372036854775808;

-- Sanity: full table (also missing MAX on columnar)
SELECT * FROM test.t_1 ORDER BY a;
SELECT COUNT(*) FROM test.t_1;

Compare with SET SESSION tidb_isolation_read_engines = 'tikv' on the same table.

Optional (MPP):

SET tidb_enforce_mpp = 1;
EXPLAIN ANALYZE SELECT * FROM test.t_1 WHERE a > -9223372036854775808;

2. What did you expect to see? (Required)

Same results as TiKV / the integration test (query.test):

Line 37SELECT * FROM test.t_1 WHERE a > -9223372036854775808:

+---------------------+------+
| a                   | col  |
+---------------------+------+
|                   0 |    3 |
| 9223372036854775807 |    2 |
+---------------------+------+
2 rows in set

Full table — 3 rows: INT64_MIN / 1, 0 / 3, INT64_MAX / 2; COUNT(*) = 3.

3. What did you see instead (Required)

On TiFlash (columnar / disaggregated read path), queries succeed without SQL error but omit the row whose clustered PK is 9223372036854775807.

Line 37 (a > INT64_MIN):

Engine Rows Result
TiKV 2 0/3, 9223372036854775807/2
TiFlash 1 0/3 only

Full table (SELECT * ORDER BY a):

Engine Rows Handles present
TiKV 3 INT64_MIN, 0, INT64_MAX
TiFlash 2 INT64_MIN, 0 only
TiFlash COUNT(*) 2

Related predicates on TiFlash (same missing MAX row):

  • WHERE a >= -9223372036854775808 — 2 rows (INT64_MIN, 0), not 3.
  • WHERE a >= 9223372036854775807 — 0 rows (expected 1).

EXPLAIN ANALYZE (MPP) may show TableFullScan on TiFlash with actRows = 2, consistent with storage returning only two handles.

This is not the enum-PK columnar bug (#10851) and not the partition-table _tidb_tid schema mismatch on transaction scans.

Investigation summary (TiFlash / proxy / columnar)

1. TiKV key encoding vs proxy decode (handle in record keys)

  • TiDB/TiKV int handle keys use memcomparable encode_i64 / decode_i64 (tidb_query_datatype::codec::table::encode_row_key / decode_int_handle), aligned with TiFlash RecordKVFormat::encodeInt64.
  • Boundary roundtrip is correct, e.g. INT64_MAX → key suffix ffffffffffffffff.

2. Columnar pack storage (logical handle)

  • RowToColumnarReader::push_handle_from_key decodes the KV key with decode_int_handle, then stores int_handle.to_le_bytes() in columnar packs (LE logical value, not the memcomparable key bytes).
  • Rows INT64_MIN and 0 are returned with correct a values on TiFlash, so pack decode/display for those handles works.

3. Read path: region range → scan bounds

  • TiFlash passes region KeyRanges to proxy (StorageDisaggregatedColumnarfn_get_columnar_reader).
  • kvengine/src/read.rs update_range_handle sets start_handle / end_handle via decode_int_handle; if the region end is the next table prefix, end_handle = None (no upper bound).
  • ColumnarMvccReader filters with handle >= end_int_handle when end_int_handle is set (half-open interval). If any caller passes Some(INT64_MAX) as the handle upper bound, the MAX row itself is excluded — worth auditing for pushed ranges / cop ranges that use encode_row_key(table_id, i64::MAX) as an exclusive end key.

For the reproduced cluster, test.t_1 had a single region and full-table scan still returned only 2 rows, so the primary issue is not a per-query region end decode alone.

4. High-suspect root cause: handle index sentinel at i64::MAX

In contrib/tiflash-proxy-columnar/components/kvengine/src/table/columnar/builder.rs, finish_table appends a sentinel handle to handle_index so pack seek can load the last pack:

// if the handle is already i64::MAX, there is no next handle.
if next_handle != i64::MAX {
    self.handle_builder.handle_index.push((next_handle + 1).to_le_bytes().to_vec());
}

When the table's maximum clustered PK is 9223372036854775807, MAX + 1 overflows, so no sentinel is added. HandleIndex::search_pack_idx + ColumnarTableReader::seek may then fail to read the last pack's row. This matches query.test, which explicitly inserts INT64_MAX as a PK.

5. Write path

  • Columnar L2 is built via RowToColumnarReader + ColumnarTableBuilder (compaction.rs transform_for_columnar / convert_row_file_to_columnar_file), with set_unbounded_handle_range() for ingest.

Fix direction (proposal):

  • In builder.rs, when max_handle == i64::MAX, append a non-overflowing upper sentinel for handle_index (e.g. a byte sequence that sorts after all row handles in memcomparable order, or a dedicated table-end marker used only for index seek).
  • Add a regression test: three rows (INT64_MIN, 0, INT64_MAX) — columnar build + read via disaggregated cop / read_block.
  • Audit set_int_handle_range / end_int_handle so exclusive end semantics do not drop the MAX row when the cop range end key is encode_row_key(t, i64::MAX).

Log analysis (mycli repro without DROP TABLE)

Finished reading remote snapshot through proxy, rows=N (StorageDisaggregatedColumnar.cpp, RNProxyInputStream destructor) counts rows delivered from proxy before schema projection (AddExtraTableIDColumnTransformAction::totalRows()).

Query Finished reading proxy snapshots Finished reading remote snapshot through proxy MPP TableFullScan outbound_rows
a > INT64_MIN rows=1 rows=1 1
Full table rows=2 rows=2 2
COUNT(*) rows=2 rows=2 2

Conclusions from logs:

  • Bug is at proxy → TiFlash boundary (MPP outbound_rows equals proxy rows), not post-projection loss.
  • Unlike enum-PK columnar bugs (rows=2 with wrong values), this is row loss (TiKV 3 vs TiFlash 2 on full scan).
  • Proxy snapshot metadata: columnar_creates.biggest ends with ...FFFFFFFFFFFFFF, but read_block returns only 2 rows — metadata upper bound vs readable rows mismatch (supports handle-index / last-pack read issue at i64::MAX).
  • filter_conditions: [] on proxy reads — predicate not pushed to columnar filter; row count difference is from scan range + stored data.

Example TiFlash log (line 37):

[INFO] Finished reading proxy snapshots, rows=1 cost=0.001s
[INFO] Finished reading remote snapshot through proxy, rows=1 bytes=13 read_cost=0.001s deserialize_cost=0.000s

4. What is your TiFlash version? (Required)

  • Environment: Next-gen / disaggregated with columnar (ENABLE_NEXT_GEN=1, ENABLE_NEXT_GEN_COLUMNAR=1), tiflash-proxy-columnar + cloud-storage-engine (tikv-server / tikv-worker).
  • Affected component: kvengine columnar (columnar/builder.rs handle index, columnar/reader.rs seek/MVCC range) → TiFlash StorageDisaggregatedColumnar / RNProxyInputStream (not classic DeltaMerge storage).
  • Reference test: tests/fullstack-test2/clustered_index/query.test line 37.

Note for reviewers: After a proxy fix, rebuild TiFlash (proxy is linked into the columnar build), restart TiFlash, and rebuild or re-compact columnar files for the table before re-running SQL verification.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions