Skip to content

[SPARK-56873][CORE] Fix race condition in bounded k-way merge in UnsafeExternalSorter#55891

Draft
ivoson wants to merge 1 commit into
apache:masterfrom
ivoson:SPARK-56873
Draft

[SPARK-56873][CORE] Fix race condition in bounded k-way merge in UnsafeExternalSorter#55891
ivoson wants to merge 1 commit into
apache:masterfrom
ivoson:SPARK-56873

Conversation

@ivoson
Copy link
Copy Markdown
Contributor

@ivoson ivoson commented May 15, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Pre-rollout hardening of the bounded multi-round k-way merge added in
SPARK-56410. Feature remains gated on
`spark.unsafe.sorter.spill.merge.factor` (default -1, disabled), so no
production behavior change today.

Bug: getSortedIterator() took the merge-input snapshot inside
UnsafeSorterBoundedSpillMerger.merge() AFTER publishing readingIterator
(SpillableIterator). Once readingIterator is non-null, an external
consumer's spill() (e.g. a Photon native consumer requesting memory from
a non-task thread) is routed to readingIterator.spill(), which both
appends a new writer to the live spillWriters list AND rebinds
readingIterator.upstream to that new file. A race between publication
and snapshot causes the new file to be fed into the final merge twice:
once via the snapshot, once via readingIterator. Result: silent
duplicate records in sorted output.

Fix: lift snapshot + boundedMerger construction + readingIterator
publication into a named seam, prepareBoundedMerge(), that takes the
defensive snapshot strictly BEFORE publishing readingIterator. The
synchronized(this) around boundedMerger assignment matches the existing
read under synchronized(this) in cleanupResources(); the volatile
modifier already covered visibility, the lock is a consistency win for
future readers.

Adds a deterministic regression test,
testBoundedMergeSnapshotIsolatedFromConcurrentSpill, that drives the
race scenario by direct sequencing via the new seam: prepare ->
external-trigger spill -> merge, asserting (a) the snapshot is
isolated from live spillWriters mutation and (b) every record appears
exactly once in the merged output. Pre-fix code fails this test
deterministically.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant