[SPARK-55792][PS] Optimize DataFrame diff axis=0 by emanhthangngot · Pull Request #55899 · apache/spark

emanhthangngot · 2026-05-15T11:02:56Z

What changes were proposed in this pull request?

This PR optimizes pandas-on-Spark DataFrame.diff(axis=0) and Series.diff() to avoid using an unpartitioned Spark Window.

The new implementation range-partitions by the natural order column, computes pandas diff() within each Spark partition, and exchanges only the boundary rows needed to preserve correctness across partition boundaries. It also keeps the existing grouped diff() path unchanged.

Additional tests cover:

absence of a Window in the analyzed plan for DataFrame.diff()
empty DataFrames
MultiIndex rows
null values
single-partition execution
zero, negative, and large periods
cross-partition boundary rows
Series.diff() delegation

Why are the changes needed?

DataFrame.diff(axis=0) currently delegates to Series._diff() without a partition specification. This creates a Spark Window over the whole DataFrame ordered by the natural order column, which can force all data into a single partition and cause scaling issues for large datasets.

This change removes that unpartitioned Window from the DataFrame.diff(axis=0) / Series.diff() path while preserving pandas-compatible positional diff semantics, including rows at partition boundaries.

Does this PR introduce any user-facing change?

Yes. DataFrame.diff(axis=0) and Series.diff() now avoid the previous unpartitioned Window execution path. The intended result values are unchanged.

How was this patch tested?

Ran:

python/run-tests --python-executables .venv/bin/python --testnames pyspark.pandas.tests.computation.test_compute

The test was run from a temporary path without spaces because the local checkout path contains spaces and Spark's Java launcher fails to start from that path.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex (GPT-5)

Codex was used to help inspect the existing implementation, identify the unpartitioned Window path, refine the patch, and prepare tests. The final changes were reviewed and validated by the author.

… dtype - Apply ruff format corrections to frame.py (dict comprehension layout, slice spacing) - Remove rowsBetween from lag window in Series._diff for Spark Connect compatibility - Update test_groupby_diff expectations to float dtype (remove .astype(int) cast)

emanhthangngot force-pushed the SPARK-55792 branch 3 times, most recently from 65c6499 to a33e14a Compare May 15, 2026 17:00

emanhthangngot added 2 commits May 16, 2026 13:38

[SPARK-55792][PS] Optimize DataFrame diff axis=0

c07e4a3

emanhthangngot force-pushed the SPARK-55792 branch from a33e14a to b5981e6 Compare May 16, 2026 06:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55792][PS] Optimize DataFrame diff axis=0#55899

[SPARK-55792][PS] Optimize DataFrame diff axis=0#55899
emanhthangngot wants to merge 2 commits into
apache:masterfrom
emanhthangngot:SPARK-55792

emanhthangngot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

emanhthangngot commented May 15, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant