[SPARK-55792][PS] Optimize DataFrame diff axis=0#55899
Open
emanhthangngot wants to merge 2 commits into
Open
Conversation
65c6499 to
a33e14a
Compare
… dtype - Apply ruff format corrections to frame.py (dict comprehension layout, slice spacing) - Remove rowsBetween from lag window in Series._diff for Spark Connect compatibility - Update test_groupby_diff expectations to float dtype (remove .astype(int) cast)
a33e14a to
b5981e6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR optimizes pandas-on-Spark
DataFrame.diff(axis=0)andSeries.diff()to avoid using an unpartitioned Spark Window.The new implementation range-partitions by the natural order column, computes pandas
diff()within each Spark partition, and exchanges only the boundary rows needed to preserve correctness across partition boundaries. It also keeps the existing groupeddiff()path unchanged.Additional tests cover:
DataFrame.diff()Series.diff()delegationWhy are the changes needed?
DataFrame.diff(axis=0)currently delegates toSeries._diff()without a partition specification. This creates a Spark Window over the whole DataFrame ordered by the natural order column, which can force all data into a single partition and cause scaling issues for large datasets.This change removes that unpartitioned Window from the
DataFrame.diff(axis=0)/Series.diff()path while preserving pandas-compatible positional diff semantics, including rows at partition boundaries.Does this PR introduce any user-facing change?
Yes.
DataFrame.diff(axis=0)andSeries.diff()now avoid the previous unpartitioned Window execution path. The intended result values are unchanged.How was this patch tested?
Ran:
The test was run from a temporary path without spaces because the local checkout path contains spaces and Spark's Java launcher fails to start from that path.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Codex (GPT-5)
Codex was used to help inspect the existing implementation, identify the unpartitioned Window path, refine the patch, and prepare tests. The final changes were reviewed and validated by the author.