Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
Reproduce in the next-gen columnar fullstack test environment (use_columnar = true required):
cd tests/fullstack-test-next-gen
./compose.sh up -d
# Run the test inside the tiflash-cn0 container
./compose.sh exec -T tiflash-cn0 bash -c \
'cd /tests && ENABLE_NEXT_GEN=true ./run-test.sh fullstack-test/expr/generated_columns2.test'
The failure occurs at line 42 of the test file:
set tidb_isolation_read_engines='tiflash';
select a, b, c, d, hour(t) from test.t where t = '000:10:10.123456';
Table schema (setup steps in generated_columns2.test):
a: regular column
b: int as (a+1) virtual (generated column)
c: regular column
d: int as (c+1) virtual (generated column)
t: time(6) regular column
In the same test file, analyze table test.t (HashAgg + full table columnar scan) succeeds; the SELECT with WHERE t = ... fails.
2. What did you expect to see? (Required)
The query should return one row:
+------+------+------+------+---------+
| a | b | c | d | hour(t) |
+------+------+------+------+---------+
| 1 | 2 | 2 | 3 | 0 |
+------+------+------+------+---------+
3. What did you see instead (Required)
The TiFlash MPP task fails. On builds without RUNTIME_CHECK, the client reports:
ERROR 1105 (HY000): other error for mpp stream: std::exception.
Code: 1001, type: std::bad_alloc, e.what() = std::bad_alloc
After adding debug logs and a RUNTIME_CHECK, a DEBUG build on a clean cluster fails consistently (~5s, not a timeout) with:
DB::Exception: Check static_cast<size_t>(i) < column_defines->size() failed:
column_defines index out of range when filtering generated columns,
table_scan_idx=4, column_defines_size=3, table_scan_size=5
Typical TiFlash log sequence for the failing SELECT (local_query_id:3 or similar):
readThroughColumnar(pipeline) begin
fn_get_columnar_reader done region_id=125, error_type=0 (proxy reader created successfully)
genColumnDefinesForDisaggregatedRead done, num_columns=3, extra_table_id_index=-10000
before filter generated columns from column_defines
- then
std::bad_alloc or the RUNTIME_CHECK failure above
Begin reading proxy snapshots never appears (pipeline fails during compile/init, task_start_timestamp=0)
For comparison, a successful analyze table:
genColumnDefinesForDisaggregatedRead done, num_columns=1
- takes the
no generated column branch and does not enter the generated-column filter loop
readThroughColumnar(pipeline) end, full scan reads 12288 rows successfully
4. What is your TiFlash version? (Required)
Local development build (next-gen columnar path, ENABLE_NEXT_GEN_COLUMNAR=1):
- Build directory:
cmake-build-debug-ng
- Test binary:
tests/.build/tiflash/tiflash
- Proxy:
contrib/tiflash-proxy-columnar (libtiflash_proxy.so)
Current Analysis (Root Cause and Impact)
Root cause
The bug is in the second-pass generated-column filtering inside genColumnDefinesForDisaggregatedReadThroughColumnar() in StorageDisaggregatedColumnar.cpp.
genColumnDefinesForDisaggregatedRead() (GenSchemaAndColumn.cpp) already skips generated columns when building column_defines, keeping only physical columns:
table_scan index |
column |
index in column_defines |
| 0 |
a |
0 |
| 1 |
b (virtual) |
(not present) |
| 2 |
c |
1 |
| 3 |
d (virtual) |
(not present) |
| 4 |
t |
2 |
So column_defines->size() == 3 while table_scan.getColumnSize() == 5.
The subsequent filter loop incorrectly uses the table_scan index i to access (*column_defines)[i]:
for (Int32 i = 0; i < table_scan.getColumnSize(); ++i)
{
if (table_scan.getColumns()[i].hasGeneratedColumnFlag())
continue;
filtered_column_defines->push_back((*column_defines)[i]); // BUG
}
Consequences:
- At
i=2, columns are misaligned: should push c (column_defines[1]), but actually pushes t (column_defines[2])
- At
i=4, out-of-bounds access: column_defines[4] does not exist → undefined behavior, surfaced as std::bad_alloc or RUNTIME_CHECK failure
This filter loop runs only when has_generated_column == true. Queries such as analyze that do not hit this path are unaffected.
Why it manifests as bad_alloc
Out-of-bounds reads/writes corrupt heap metadata or trigger abnormal copies in push_back, causing std::bad_alloc during MPP task initialization instead of a clear index error. Adding RUNTIME_CHECK fails fast before the OOB access.
Relationship to proxy filter pushdown
Investigation also showed proxy logs with filter_conditions: [] in make_columnar_reader, even though the TiFlash plan contains an EQDuration predicate. That is a separate issue (filter pushdown) and not the direct cause of this crash. After fixing the index bug, the pipeline should build, but whether rows are filtered correctly still needs follow-up validation.
Proposed fix
Option A (recommended): Remove the redundant generated-column filter loop in genColumnDefinesForDisaggregatedReadThroughColumnar(). genColumnDefinesForDisaggregatedRead() already excludes generated columns; the columnar path later fills virtual columns via executeGeneratedColumnPlaceholder(), consistent with the tiflash-write disaggregated read path.
Option B: Keep the loop structure but iterate column_defines with a separate column_define_idx, and do not read extra_table_id from column_defines using table_scan indices. Under current semantics this is equivalent to copying column_defines and is unnecessary compared to Option A.
Related code
dbms/src/Storages/StorageDisaggregatedColumnar.cpp: genColumnDefinesForDisaggregatedReadThroughColumnar()
dbms/src/Flash/Coprocessor/GenSchemaAndColumn.cpp: genColumnDefinesForDisaggregatedRead()
- Test:
tests/fullstack-test/expr/generated_columns2.test
Reproduction notes
- Use a clean cluster (
compose down, delete data/log, then up -d). Restarting only tiflash-cn0 can leave columnar snapshot 404 errors that mask this bug.
- Columnar read must be enabled in cluster config:
use_columnar = true in tests/docker/next-gen-config/tiflash_cn.toml.
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
Reproduce in the next-gen columnar fullstack test environment (
use_columnar = truerequired):The failure occurs at line 42 of the test file:
Table schema (setup steps in
generated_columns2.test):a: regular columnb:int as (a+1) virtual(generated column)c: regular columnd:int as (c+1) virtual(generated column)t:time(6)regular columnIn the same test file,
analyze table test.t(HashAgg + full table columnar scan) succeeds; the SELECT withWHERE t = ...fails.2. What did you expect to see? (Required)
The query should return one row:
3. What did you see instead (Required)
The TiFlash MPP task fails. On builds without
RUNTIME_CHECK, the client reports:After adding debug logs and a
RUNTIME_CHECK, a DEBUG build on a clean cluster fails consistently (~5s, not a timeout) with:Typical TiFlash log sequence for the failing SELECT (
local_query_id:3or similar):readThroughColumnar(pipeline) beginfn_get_columnar_reader done region_id=125, error_type=0(proxy reader created successfully)genColumnDefinesForDisaggregatedRead done, num_columns=3, extra_table_id_index=-10000before filter generated columns from column_definesstd::bad_allocor theRUNTIME_CHECKfailure aboveBegin reading proxy snapshotsnever appears (pipeline fails during compile/init,task_start_timestamp=0)For comparison, a successful
analyze table:genColumnDefinesForDisaggregatedRead done, num_columns=1no generated columnbranch and does not enter the generated-column filter loopreadThroughColumnar(pipeline) end, full scan reads 12288 rows successfully4. What is your TiFlash version? (Required)
Local development build (next-gen columnar path,
ENABLE_NEXT_GEN_COLUMNAR=1):cmake-build-debug-ngtests/.build/tiflash/tiflashcontrib/tiflash-proxy-columnar(libtiflash_proxy.so)Current Analysis (Root Cause and Impact)
Root cause
The bug is in the second-pass generated-column filtering inside
genColumnDefinesForDisaggregatedReadThroughColumnar()inStorageDisaggregatedColumnar.cpp.genColumnDefinesForDisaggregatedRead()(GenSchemaAndColumn.cpp) already skips generated columns when buildingcolumn_defines, keeping only physical columns:table_scanindexcolumn_definesSo
column_defines->size() == 3whiletable_scan.getColumnSize() == 5.The subsequent filter loop incorrectly uses the table_scan index
ito access(*column_defines)[i]:Consequences:
i=2, columns are misaligned: should pushc(column_defines[1]), but actually pushest(column_defines[2])i=4, out-of-bounds access:column_defines[4]does not exist → undefined behavior, surfaced asstd::bad_allocorRUNTIME_CHECKfailureThis filter loop runs only when
has_generated_column == true. Queries such asanalyzethat do not hit this path are unaffected.Why it manifests as
bad_allocOut-of-bounds reads/writes corrupt heap metadata or trigger abnormal copies in
push_back, causingstd::bad_allocduring MPP task initialization instead of a clear index error. AddingRUNTIME_CHECKfails fast before the OOB access.Relationship to proxy filter pushdown
Investigation also showed proxy logs with
filter_conditions: []inmake_columnar_reader, even though the TiFlash plan contains anEQDurationpredicate. That is a separate issue (filter pushdown) and not the direct cause of this crash. After fixing the index bug, the pipeline should build, but whether rows are filtered correctly still needs follow-up validation.Proposed fix
Option A (recommended): Remove the redundant generated-column filter loop in
genColumnDefinesForDisaggregatedReadThroughColumnar().genColumnDefinesForDisaggregatedRead()already excludes generated columns; the columnar path later fills virtual columns viaexecuteGeneratedColumnPlaceholder(), consistent with the tiflash-write disaggregated read path.Option B: Keep the loop structure but iterate
column_defineswith a separatecolumn_define_idx, and do not readextra_table_idfromcolumn_definesusing table_scan indices. Under current semantics this is equivalent to copyingcolumn_definesand is unnecessary compared to Option A.Related code
dbms/src/Storages/StorageDisaggregatedColumnar.cpp:genColumnDefinesForDisaggregatedReadThroughColumnar()dbms/src/Flash/Coprocessor/GenSchemaAndColumn.cpp:genColumnDefinesForDisaggregatedRead()tests/fullstack-test/expr/generated_columns2.testReproduction notes
compose down, deletedata/log, thenup -d). Restarting onlytiflash-cn0can leave columnar snapshot 404 errors that mask this bug.use_columnar = trueintests/docker/next-gen-config/tiflash_cn.toml.