Skip to content

[Spark] SparkWidthBucket return_type is Int32, should be Int64 to match Spark #22602

@mbutrovich

Description

@mbutrovich

Describe the bug

SparkWidthBucket::return_type returns Int32, but Spark's WidthBucket.dataType is LongType:

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1825

// datafusion/spark/src/function/math/width_bucket.rs
fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
    Ok(Int32)
}

The n_bucket input was aligned to i64 to match Spark in #20330, but the return type was left as Int32. The kernel still builds Int32Array.

This produces wrong results in any consumer that plans against Spark's declared output type (Int64) but receives an Int32Array at runtime: with two rows per batch, the consumer reads 16 bytes of Int64 from an 8-byte Int32 buffer, packing two int32 values into a single int64 and reading uninitialized bytes for the rest.

Concretely, for width_bucket(value, 0.0, 10.0, 5) over Range(0, 10) split into 5 partitions of 2 rows each:

value expected (Int64) observed
0 1 4294967297 (= 0x1_00000001)
1 1 0
2 2 8589934594 (= 0x2_00000002)
3 2 0
... ... ...

To Reproduce

Run any consumer that respects Spark's declared LongType for WidthBucket against SparkWidthBucket. Reproduces in DataFusion Comet on the width_bucket - with range data test in CometMathExpressionSuite (apache/datafusion-comet#4347).

Expected behavior

SparkWidthBucket::return_type returns Int64 and the kernel builds Int64Array, matching Spark.

Additional context

Related: #20330 (input parameter alignment).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingspark

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions