perf: Optimize some decimal expressions by andygrove · Pull Request #3619 · apache/datafusion-comet

andygrove · 2026-03-02T17:53:38Z

Summary

For some decimal operations, we cast the inputs to Decimal256, perform the math operation, and then cast the result back to Decimal128:

let left = Arc::new(Cast::new(
    left,
    DataType::Decimal256(p1, s1),
    SparkCastOptions::new_without_timezone(EvalMode::Legacy, false),
));
let right = Arc::new(Cast::new(
    right,
    DataType::Decimal256(p2, s2),
    SparkCastOptions::new_without_timezone(EvalMode::Legacy, false),
));
let child = Arc::new(BinaryExpr::new(left, op, right));
Ok(Arc::new(Cast::new(
    child,
    data_type,
    SparkCastOptions::new_without_timezone(EvalMode::Legacy, false),
)))

This causes intermediate allocations and can be optimized with a custom WideDecimalBinaryExpr that performs i256 register arithmetic directly, reducing per-batch allocation from 4 intermediate arrays (3 Decimal256 @ 32 bytes/elem + 1 Decimal128 @ 16 bytes/elem = 112 bytes/elem) to 1 output array (16 bytes/elem)

TPC-H q1

Before:

        "durations": [
            8.183,
            4.728,
            4.597,
            4.562,
            4.538
        ],

After:

        "durations": [
            7.874,
            4.327,
            4.185,
            4.135,
            4.069
        ],

Criterion benchmark results (8192 element batches)

Case	Old	Fused	Speedup
add (same scale)	171 µs	57 µs	3.0x
add (diff scale)	173 µs	57 µs	3.0x
multiply	361 µs	305 µs	1.2x
subtract	173 µs	58 µs	3.0x

How it works

WideDecimalBinaryExpr evaluates left/right children, performs add/sub/mul using i256 intermediates via arrow::compute::kernels::arity::try_binary, applies scale adjustment with HALF_UP rounding, checks precision bounds, and outputs a single Decimal128 array. Follows the same pattern as decimal_div in div.rs.

Overflow handling matches existing behavior:

Ansi mode: returns ArrowError::ComputeError
Legacy/Try mode: uses i128::MAX sentinel + null_if_overflow_precision

Replace the 4-node expression tree (Cast→BinaryExpr→Cast→Cast) used for Decimal128 arithmetic that may overflow with a single fused expression that performs i256 register arithmetic directly. This reduces per-batch allocation from 4 intermediate arrays (112 bytes/elem) to 1 output array (16 bytes/elem). The new WideDecimalBinaryExpr evaluates children, performs add/sub/mul using i256 intermediates via try_binary, applies scale adjustment with HALF_UP rounding, checks precision bounds, and outputs a single Decimal128 array. Follows the same pattern as decimal_div.

Add benchmark comparing old Cast->BinaryExpr->Cast chain vs fused WideDecimalBinaryExpr for Decimal128 add/sub/mul. Covers four cases: add with same scale, add with different scales, multiply, and subtract.

andygrove · 2026-03-02T17:55:52Z

@sqlbenchmark run tpch --iterations 3

andygrove · 2026-03-02T18:44:22Z

@sqlbenchmark run tpch --iterations 3

Eliminate redundant CheckOverflow when wrapping WideDecimalBinaryExpr (which already handles overflow). Fuse Cast(Decimal128→Decimal128) + CheckOverflow into a single DecimalRescaleCheckOverflow expression that rescales and validates precision in one pass.

andygrove · 2026-03-02T20:49:24Z

@sqlbenchmark run tpch --iterations 3

sqlbenchmark · 2026-03-02T21:47:04Z

Benchmark job comet-pr-3619-c3986837690 failed due to an error.

coderfender · 2026-03-04T15:48:57Z

This is awesome !

parthchandra · 2026-03-05T00:12:05Z

native/spark-expr/src/math_funcs/wide_decimal_binary_expr.rs

+                let rescale_divisor = if need_rescale {
+                    i256_pow10((max_scale - s_out) as u32)
+                } else {
+                    i256::ONE


Don't we need to scale up to s_out if s_out > max_scale?

martin-g · 2026-03-05T11:23:02Z

native/spark-expr/src/math_funcs/internal/decimal_rescale_check.rs

+            ColumnarValue::Scalar(ScalarValue::Decimal128(v, _precision, _scale)) => {
+                let new_v = v.and_then(|val| {
+                    rescale_and_check(val, delta, scale_factor, bound, fail_on_error)
+                        .ok()


If fail_on_error=true then this .ok() will convert it to None and hide the error

martin-g · 2026-03-05T11:31:16Z

native/spark-expr/src/math_funcs/internal/decimal_rescale_check.rs

+
+    fn evaluate(&self, batch: &RecordBatch) -> datafusion::common::Result<ColumnarValue> {
+        let arg = self.child.evaluate(batch)?;
+        let delta = self.output_scale - self.input_scale;


The scales could be negative.
Let's say output_scale=38 ( the maximum supported one) and input_scale=-1, the delta=39.
This will lead to an error below at 10i128.pow(abs_delta as u32);

martin-g · 2026-03-05T11:33:15Z

native/spark-expr/src/math_funcs/internal/decimal_rescale_check.rs

+/// Maximum absolute value for a given decimal precision: 10^p - 1.
+#[inline]
+fn precision_bound(precision: u8) -> i128 {
+    10i128.pow(precision as u32) - 1


maybe add some validation that precision is <=38 ?
or use checked_pow() and return a Result

martin-g · 2026-03-05T11:35:46Z

native/spark-expr/src/math_funcs/wide_decimal_binary_expr.rs

+
+/// Compute 10^exp as i256.
+#[inline]
+fn i256_pow10(exp: u32) -> i256 {


extreme case: exp>77 will overflow

martin-g · 2026-03-05T11:36:59Z

native/core/src/execution/planner.rs

                DataFusionOperator::Plus | DataFusionOperator::Minus | DataFusionOperator::Multiply,
-                Ok(DataType::Decimal128(p1, s1)),
-                Ok(DataType::Decimal128(p2, s2)),
+                Ok(DataType::Decimal128(_p1, _s1)),


What is the reason to use _ prefix ? The variables are used

martin-g · 2026-03-05T11:37:49Z

native/spark-expr/src/math_funcs/internal/decimal_rescale_check.rs

+        // divisor = 10^(-delta), half = divisor / 2
+        let divisor = scale_factor; // already 10^abs(delta)
+        let half = divisor / 2;
+        let sign = if value < 0 { -1i128 } else { 1i128 };


Suggested change

let sign = if value < 0 { -1i128 } else { 1i128 };

let sign = value.signum();

martin-g · 2026-03-05T11:43:49Z

native/core/src/execution/planner.rs

+                }
+
+                // Fuse Cast(Decimal128→Decimal128) + CheckOverflow into single rescale+check
+                if let Some(cast) = child.as_any().downcast_ref::<Cast>() {


Should this also check that the Cast's precision/scale match the output's p/s before fusing ?

martin-g · 2026-03-05T13:05:25Z

native/spark-expr/src/math_funcs/internal/decimal_rescale_check.rs

+    fn with_new_children(
+        self: Arc<Self>,
+        children: Vec<Arc<dyn PhysicalExpr>>,
+    ) -> datafusion::common::Result<Arc<dyn PhysicalExpr>> {


Suggested change

) -> datafusion::common::Result<Arc<dyn PhysicalExpr>> {

) -> datafusion::common::Result<Arc<dyn PhysicalExpr>> {

if children.len() != 1 {

return Err(DataFusionError::Internal(format!(

"DecimalRescaleCheckOverflow expects 1 child, got {}",

children.len()

)));

}

martin-g · 2026-03-05T13:06:50Z

native/spark-expr/src/math_funcs/wide_decimal_binary_expr.rs

+    fn with_new_children(
+        self: Arc<Self>,
+        children: Vec<Arc<dyn PhysicalExpr>>,
+    ) -> Result<Arc<dyn PhysicalExpr>> {


Suggested change

) -> Result<Arc<dyn PhysicalExpr>> {

) -> Result<Arc<dyn PhysicalExpr>> {

if children.len() != 2 {

return Err(DataFusionError::Internal(format!(

"WideDecimalBinaryExpr expects 2 children, got {}",

children.len()

)));

}

andygrove added 2 commits March 2, 2026 10:55

feat: add criterion benchmark for wide decimal binary expr

d7495bd

Add benchmark comparing old Cast->BinaryExpr->Cast chain vs fused WideDecimalBinaryExpr for Decimal128 add/sub/mul. Covers four cases: add with same scale, add with different scales, multiply, and subtract.

andygrove force-pushed the wide-decimal-binary-expr branch from cb52636 to d7495bd Compare March 2, 2026 17:55

andygrove force-pushed the wide-decimal-binary-expr branch from 5a21500 to 91092a6 Compare March 2, 2026 18:46

andygrove changed the title ~~feat: fused WideDecimalBinaryExpr for Decimal128 add/sub/mul~~ perf: Optimize some decimal expressions Mar 3, 2026

Merge branch 'main' into wide-decimal-binary-expr

486eda2

andygrove marked this pull request as ready for review March 3, 2026 15:52

parthchandra reviewed Mar 5, 2026

View reviewed changes

martin-g reviewed Mar 5, 2026

View reviewed changes

	let sign = if value < 0 { -1i128 } else { 1i128 };
	let sign = value.signum();

-    ) -> datafusion::common::Result<Arc<dyn PhysicalExpr>> {
+    ) -> datafusion::common::Result<Arc<dyn PhysicalExpr>> {
+        if children.len() != 1 {
+            return Err(DataFusionError::Internal(format!(
+                "DecimalRescaleCheckOverflow expects 1 child, got {}",
+                    children.len()
+            )));
+        }

Conversation

andygrove commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

TPC-H q1

Criterion benchmark results (8192 element batches)

How it works

Uh oh!

andygrove commented Mar 2, 2026

Uh oh!

andygrove commented Mar 2, 2026

Uh oh!

andygrove commented Mar 2, 2026

Uh oh!

sqlbenchmark commented Mar 2, 2026

Uh oh!

coderfender commented Mar 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andygrove commented Mar 2, 2026 •

edited

Loading