Skip to content

Next-gen columnar: Columnar disaggregated read crashes on generated columns #10856

@JaySon-Huang

Description

@JaySon-Huang

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Reproduce in the next-gen columnar fullstack test environment (use_columnar = true required):

cd tests/fullstack-test-next-gen
./compose.sh up -d

# Run the test inside the tiflash-cn0 container
./compose.sh exec -T tiflash-cn0 bash -c \
  'cd /tests && ENABLE_NEXT_GEN=true ./run-test.sh fullstack-test/expr/generated_columns2.test'

The failure occurs at line 42 of the test file:

set tidb_isolation_read_engines='tiflash';
select a, b, c, d, hour(t) from test.t where t = '000:10:10.123456';

Table schema (setup steps in generated_columns2.test):

  • a: regular column
  • b: int as (a+1) virtual (generated column)
  • c: regular column
  • d: int as (c+1) virtual (generated column)
  • t: time(6) regular column

In the same test file, analyze table test.t (HashAgg + full table columnar scan) succeeds; the SELECT with WHERE t = ... fails.

2. What did you expect to see? (Required)

The query should return one row:

+------+------+------+------+---------+
| a    | b    | c    | d    | hour(t) |
+------+------+------+------+---------+
|    1 |    2 |    2 |    3 |       0 |
+------+------+------+------+---------+

3. What did you see instead (Required)

The TiFlash MPP task fails. On builds without RUNTIME_CHECK, the client reports:

ERROR 1105 (HY000): other error for mpp stream: std::exception.
Code: 1001, type: std::bad_alloc, e.what() = std::bad_alloc

After adding debug logs and a RUNTIME_CHECK, a DEBUG build on a clean cluster fails consistently (~5s, not a timeout) with:

DB::Exception: Check static_cast<size_t>(i) < column_defines->size() failed:
column_defines index out of range when filtering generated columns,
table_scan_idx=4, column_defines_size=3, table_scan_size=5

Typical TiFlash log sequence for the failing SELECT (local_query_id:3 or similar):

  • readThroughColumnar(pipeline) begin
  • fn_get_columnar_reader done region_id=125, error_type=0 (proxy reader created successfully)
  • genColumnDefinesForDisaggregatedRead done, num_columns=3, extra_table_id_index=-10000
  • before filter generated columns from column_defines
  • then std::bad_alloc or the RUNTIME_CHECK failure above
  • Begin reading proxy snapshots never appears (pipeline fails during compile/init, task_start_timestamp=0)

For comparison, a successful analyze table:

  • genColumnDefinesForDisaggregatedRead done, num_columns=1
  • takes the no generated column branch and does not enter the generated-column filter loop
  • readThroughColumnar(pipeline) end, full scan reads 12288 rows successfully

4. What is your TiFlash version? (Required)

Local development build (next-gen columnar path, ENABLE_NEXT_GEN_COLUMNAR=1):

  • Build directory: cmake-build-debug-ng
  • Test binary: tests/.build/tiflash/tiflash
  • Proxy: contrib/tiflash-proxy-columnar (libtiflash_proxy.so)

Current Analysis (Root Cause and Impact)

Root cause

The bug is in the second-pass generated-column filtering inside genColumnDefinesForDisaggregatedReadThroughColumnar() in StorageDisaggregatedColumnar.cpp.

genColumnDefinesForDisaggregatedRead() (GenSchemaAndColumn.cpp) already skips generated columns when building column_defines, keeping only physical columns:

table_scan index column index in column_defines
0 a 0
1 b (virtual) (not present)
2 c 1
3 d (virtual) (not present)
4 t 2

So column_defines->size() == 3 while table_scan.getColumnSize() == 5.

The subsequent filter loop incorrectly uses the table_scan index i to access (*column_defines)[i]:

for (Int32 i = 0; i < table_scan.getColumnSize(); ++i)
{
    if (table_scan.getColumns()[i].hasGeneratedColumnFlag())
        continue;
    filtered_column_defines->push_back((*column_defines)[i]);  // BUG
}

Consequences:

  1. At i=2, columns are misaligned: should push c (column_defines[1]), but actually pushes t (column_defines[2])
  2. At i=4, out-of-bounds access: column_defines[4] does not exist → undefined behavior, surfaced as std::bad_alloc or RUNTIME_CHECK failure

This filter loop runs only when has_generated_column == true. Queries such as analyze that do not hit this path are unaffected.

Why it manifests as bad_alloc

Out-of-bounds reads/writes corrupt heap metadata or trigger abnormal copies in push_back, causing std::bad_alloc during MPP task initialization instead of a clear index error. Adding RUNTIME_CHECK fails fast before the OOB access.

Relationship to proxy filter pushdown

Investigation also showed proxy logs with filter_conditions: [] in make_columnar_reader, even though the TiFlash plan contains an EQDuration predicate. That is a separate issue (filter pushdown) and not the direct cause of this crash. After fixing the index bug, the pipeline should build, but whether rows are filtered correctly still needs follow-up validation.

Proposed fix

Option A (recommended): Remove the redundant generated-column filter loop in genColumnDefinesForDisaggregatedReadThroughColumnar(). genColumnDefinesForDisaggregatedRead() already excludes generated columns; the columnar path later fills virtual columns via executeGeneratedColumnPlaceholder(), consistent with the tiflash-write disaggregated read path.

Option B: Keep the loop structure but iterate column_defines with a separate column_define_idx, and do not read extra_table_id from column_defines using table_scan indices. Under current semantics this is equivalent to copying column_defines and is unnecessary compared to Option A.

Related code

  • dbms/src/Storages/StorageDisaggregatedColumnar.cpp: genColumnDefinesForDisaggregatedReadThroughColumnar()
  • dbms/src/Flash/Coprocessor/GenSchemaAndColumn.cpp: genColumnDefinesForDisaggregatedRead()
  • Test: tests/fullstack-test/expr/generated_columns2.test

Reproduction notes

  • Use a clean cluster (compose down, delete data/log, then up -d). Restarting only tiflash-cn0 can leave columnar snapshot 404 errors that mask this bug.
  • Columnar read must be enabled in cluster config: use_columnar = true in tests/docker/next-gen-config/tiflash_cn.toml.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions