Skip to content

[WIP][CONNECT] Support directory uploads in Spark Connect copyFromLocalToFs#55904

Open
RaghunandanKumar wants to merge 1 commit into
apache:masterfrom
RaghunandanKumar:codex/connect-copy-dir
Open

[WIP][CONNECT] Support directory uploads in Spark Connect copyFromLocalToFs#55904
RaghunandanKumar wants to merge 1 commit into
apache:masterfrom
RaghunandanKumar:codex/connect-copy-dir

Conversation

@RaghunandanKumar
Copy link
Copy Markdown

What changes were proposed in this pull request?

This change teaches Spark Connect SparkSession.copyFromLocalToFs to accept a local directory path in addition to a single file path.

Changes in this PR:

  • update the Spark Connect artifact manager to expand a local directory into per-file forward_to_fs artifacts while preserving the nested relative layout
  • keep the existing file upload path unchanged
  • update the PySpark ML Connect helper to use the new recursive directory-copy behavior in remote mode
  • remove the old one-level directory limitation from the local ML helper as well
  • add a regression test that copies a directory containing a nested file tree and verifies both files arrive at the destination

Why are the changes needed?

Today the Connect artifact path only accepts a single file for copyFromLocalToFs, even though PySpark ML Connect has a directory-copy helper and model save flows naturally need to stage directory trees.

This leaves two rough edges:

  • the Connect path cannot directly copy a directory tree
  • the ML helper had its own one-level directory workaround instead of reusing a stronger Connect primitive

Supporting recursive directory uploads in the Connect artifact path makes the API more generally useful and removes the need for the shallow workaround in ML Connect.

Does this PR introduce any user-facing change?

Yes.

Before this change, SparkSession.copyFromLocalToFs(local_dir, dest_path) in Spark Connect only supported a single local file path and would not handle a directory tree.

After this change, the same API accepts a local directory and copies all files under it recursively while preserving relative paths under the destination.

How was this patch tested?

Added a focused regression test in pyspark.sql.tests.connect.client.test_artifact covering nested directory copy.

Attempted local verification with:

  • python/run-tests --testnames pyspark.sql.tests.connect.client.test_artifact
  • build/sbt -Phive package

In this environment, local verification is currently blocked because Spark is not built and the machine does not have a usable Java runtime configured for build/sbt.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex GPT-5

@RaghunandanKumar RaghunandanKumar marked this pull request as ready for review May 15, 2026 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant