Skip to content

perf: Optimize some decimal expressions#3619

Open
andygrove wants to merge 4 commits intoapache:mainfrom
andygrove:wide-decimal-binary-expr
Open

perf: Optimize some decimal expressions#3619
andygrove wants to merge 4 commits intoapache:mainfrom
andygrove:wide-decimal-binary-expr

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Mar 2, 2026

Summary

For some decimal operations, we cast the inputs to Decimal256, perform the math operation, and then cast the result back to Decimal128:

let left = Arc::new(Cast::new(
    left,
    DataType::Decimal256(p1, s1),
    SparkCastOptions::new_without_timezone(EvalMode::Legacy, false),
));
let right = Arc::new(Cast::new(
    right,
    DataType::Decimal256(p2, s2),
    SparkCastOptions::new_without_timezone(EvalMode::Legacy, false),
));
let child = Arc::new(BinaryExpr::new(left, op, right));
Ok(Arc::new(Cast::new(
    child,
    data_type,
    SparkCastOptions::new_without_timezone(EvalMode::Legacy, false),
)))

This causes intermediate allocations and can be optimized with a custom WideDecimalBinaryExpr that performs i256 register arithmetic directly, reducing per-batch allocation from 4 intermediate arrays (3 Decimal256 @ 32 bytes/elem + 1 Decimal128 @ 16 bytes/elem = 112 bytes/elem) to 1 output array (16 bytes/elem)

TPC-H q1

Before:

        "durations": [
            8.183,
            4.728,
            4.597,
            4.562,
            4.538
        ],

After:

        "durations": [
            7.874,
            4.327,
            4.185,
            4.135,
            4.069
        ],

Criterion benchmark results (8192 element batches)

Case Old Fused Speedup
add (same scale) 171 µs 57 µs 3.0x
add (diff scale) 173 µs 57 µs 3.0x
multiply 361 µs 305 µs 1.2x
subtract 173 µs 58 µs 3.0x

How it works

WideDecimalBinaryExpr evaluates left/right children, performs add/sub/mul using i256 intermediates via arrow::compute::kernels::arity::try_binary, applies scale adjustment with HALF_UP rounding, checks precision bounds, and outputs a single Decimal128 array. Follows the same pattern as decimal_div in div.rs.

Overflow handling matches existing behavior:

  • Ansi mode: returns ArrowError::ComputeError
  • Legacy/Try mode: uses i128::MAX sentinel + null_if_overflow_precision

Replace the 4-node expression tree (Cast→BinaryExpr→Cast→Cast) used for
Decimal128 arithmetic that may overflow with a single fused expression
that performs i256 register arithmetic directly. This reduces per-batch
allocation from 4 intermediate arrays (112 bytes/elem) to 1 output array
(16 bytes/elem).

The new WideDecimalBinaryExpr evaluates children, performs add/sub/mul
using i256 intermediates via try_binary, applies scale adjustment with
HALF_UP rounding, checks precision bounds, and outputs a single
Decimal128 array. Follows the same pattern as decimal_div.
Add benchmark comparing old Cast->BinaryExpr->Cast chain vs fused
WideDecimalBinaryExpr for Decimal128 add/sub/mul. Covers four cases:
add with same scale, add with different scales, multiply, and subtract.
@andygrove andygrove force-pushed the wide-decimal-binary-expr branch from cb52636 to d7495bd Compare March 2, 2026 17:55
@andygrove
Copy link
Member Author

@sqlbenchmark run tpch --iterations 3

1 similar comment
@andygrove
Copy link
Member Author

@sqlbenchmark run tpch --iterations 3

Eliminate redundant CheckOverflow when wrapping WideDecimalBinaryExpr
(which already handles overflow). Fuse Cast(Decimal128→Decimal128) +
CheckOverflow into a single DecimalRescaleCheckOverflow expression that
rescales and validates precision in one pass.
@andygrove andygrove force-pushed the wide-decimal-binary-expr branch from 5a21500 to 91092a6 Compare March 2, 2026 18:46
@andygrove
Copy link
Member Author

@sqlbenchmark run tpch --iterations 3

@sqlbenchmark
Copy link

Benchmark job comet-pr-3619-c3986837690 failed due to an error.

@andygrove andygrove changed the title feat: fused WideDecimalBinaryExpr for Decimal128 add/sub/mul perf: Optimize some decimal expressions Mar 3, 2026
@andygrove andygrove marked this pull request as ready for review March 3, 2026 15:52
@coderfender
Copy link
Contributor

This is awesome !

let rescale_divisor = if need_rescale {
i256_pow10((max_scale - s_out) as u32)
} else {
i256::ONE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to scale up to s_out if s_out > max_scale?

ColumnarValue::Scalar(ScalarValue::Decimal128(v, _precision, _scale)) => {
let new_v = v.and_then(|val| {
rescale_and_check(val, delta, scale_factor, bound, fail_on_error)
.ok()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fail_on_error=true then this .ok() will convert it to None and hide the error


fn evaluate(&self, batch: &RecordBatch) -> datafusion::common::Result<ColumnarValue> {
let arg = self.child.evaluate(batch)?;
let delta = self.output_scale - self.input_scale;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scales could be negative.
Let's say output_scale=38 ( the maximum supported one) and input_scale=-1, the delta=39.
This will lead to an error below at 10i128.pow(abs_delta as u32);

/// Maximum absolute value for a given decimal precision: 10^p - 1.
#[inline]
fn precision_bound(precision: u8) -> i128 {
10i128.pow(precision as u32) - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add some validation that precision is <=38 ?
or use checked_pow() and return a Result


/// Compute 10^exp as i256.
#[inline]
fn i256_pow10(exp: u32) -> i256 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extreme case: exp>77 will overflow

DataFusionOperator::Plus | DataFusionOperator::Minus | DataFusionOperator::Multiply,
Ok(DataType::Decimal128(p1, s1)),
Ok(DataType::Decimal128(p2, s2)),
Ok(DataType::Decimal128(_p1, _s1)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to use _ prefix ? The variables are used

// divisor = 10^(-delta), half = divisor / 2
let divisor = scale_factor; // already 10^abs(delta)
let half = divisor / 2;
let sign = if value < 0 { -1i128 } else { 1i128 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let sign = if value < 0 { -1i128 } else { 1i128 };
let sign = value.signum();

}

// Fuse Cast(Decimal128→Decimal128) + CheckOverflow into single rescale+check
if let Some(cast) = child.as_any().downcast_ref::<Cast>() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also check that the Cast's precision/scale match the output's p/s before fusing ?

fn with_new_children(
self: Arc<Self>,
children: Vec<Arc<dyn PhysicalExpr>>,
) -> datafusion::common::Result<Arc<dyn PhysicalExpr>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> datafusion::common::Result<Arc<dyn PhysicalExpr>> {
) -> datafusion::common::Result<Arc<dyn PhysicalExpr>> {
if children.len() != 1 {
return Err(DataFusionError::Internal(format!(
"DecimalRescaleCheckOverflow expects 1 child, got {}",
children.len()
)));
}

fn with_new_children(
self: Arc<Self>,
children: Vec<Arc<dyn PhysicalExpr>>,
) -> Result<Arc<dyn PhysicalExpr>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> Result<Arc<dyn PhysicalExpr>> {
) -> Result<Arc<dyn PhysicalExpr>> {
if children.len() != 2 {
return Err(DataFusionError::Internal(format!(
"WideDecimalBinaryExpr expects 2 children, got {}",
children.len()
)));
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants