Skip to content

Flaky test: push_down_filter_regression.slt dynamic filter content is non-deterministic #22621

@diegoQuinas

Description

@diegoQuinas

Describe the bug

The sqllogictest push_down_filter_regression.slt (added in #22150) is flaky in CI. The DynamicFilter content asserted by the EXPLAIN ANALYZE queries on agg_dyn_single is not deterministic, contrary to what the test's own comment claims.

A recent CI run failed with:

[SQL] EXPLAIN ANALYZE SELECT MIN(a), MAX(a) FROM agg_dyn_single;
[Diff] (-expected|+actual)
- predicate=DynamicFilter [ a@0 < 1 OR a@0 > 8 ], pruning_predicate=... a_min@0 < 1 ...
+ predicate=DynamicFilter [ a@0 < 3 OR a@0 > 8 ], pruning_predicate=... a_min@0 < 3 ...
at datafusion/sqllogictest/test_files/push_down_filter_regression.slt:330

Root cause

The test data is split across two files:

  • file_0(5), (1) — partial min = 1 (the global minimum)
  • file_1(3), (8) — partial min = 3, partial max = 8

The comment above the queries states:

Pruning metrics here are subject to a parallel-execution race (the order in which Partial aggregates publish filter updates vs. when the scan reads each partition), so the filter content is deterministic but the pruning counts are not.

That assumption is incorrect. The dynamic filter threshold tightens as each AggregateExec(mode=Partial) publishes its running min/max. EXPLAIN ANALYZE captures a snapshot of the filter's state. The same race the comment acknowledges for the pruning counts also affects the filter content: if the snapshot is taken after file_1 has published its partial min (3) but before file_0 publishes the global min (1), the filter reads a < 3 instead of the final a < 1. The MAX side (> 8) happened to converge in time.

So the filter content is an intermediate value of a converging filter, and which value is observed depends on partition scheduling — exactly the non-determinism the comment attributes only to the counts.

To Reproduce

Hard to reproduce deterministically because it is a thread-scheduling race; it surfaces intermittently in CI. The failing assertions are the agg_dyn_single EXPLAIN ANALYZE queries in datafusion/sqllogictest/test_files/push_down_filter_regression.slt (around line 330).

Expected behavior

The test should be stable across runs and not depend on the order in which partial aggregates publish their filter updates.

Additional context

Possible directions (open to maintainer preference):

  1. Assert only on the shape of the dynamic filter (e.g. that a DynamicFilter is present with the right column/structure) rather than its converged threshold value.
  2. Force a single partition / deterministic scan order for these specific queries so the filter is guaranteed to be fully converged at snapshot time.
  3. Use data where every file shares the same per-file min/max so any intermediate snapshot equals the final value.

Introduced in #22150. Happy to open a PR once there's agreement on the preferred approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions