Support sharding through config and raster_write_kwargs#1106
Conversation
- Added additional settings - Allow environment variables that overwrite config
|
Failing atm due to ome-zarr not yet being released. You can test locally with ome-zarr-py from main. Also, need to add support for zarrs to improve speed of shard io |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1106 +/- ##
==========================================
+ Coverage 92.44% 92.50% +0.05%
==========================================
Files 51 51
Lines 7811 7857 +46
==========================================
+ Hits 7221 7268 +47
+ Misses 590 589 -1
🚀 New features to boost your workflow:
|
The reason for only supporting these versions is that they provide the proper use of the zarr api inside dask and also the possibility for setting the tune optimization. The latter is required to prevent errors due to collapsing dask partitions when reading data back in from parquet.
|
Should we also allow the control of sharding for anndata? |
|
Yes, but not as part of this PR. I will adjust the config though to accommodate. |
|
| assert arr.shards == write_shards | ||
|
|
||
| other_arr = zarr.open_group(path / zarr_subpath / other_name, mode="r")["s0"] | ||
| assert other_arr.chunks == base_chunks |
There was a problem hiding this comment.
we could add an explicit check that shards are None (?) here.
There was a problem hiding this comment.
don't think it is necessary to add here. The other writing tests basically don't specify it. This is particularly for when it is specified. If it is somehow still specified you would get permission errors straight away as the chunks and shards would not match.
| assert other_arr.chunks == base_chunks | ||
|
|
||
|
|
||
| def test_write_raster_elements_sharding_chunking(tmp_path: Path) -> None: |
There was a problem hiding this comment.
What are we testing that it is not tested before? That write_element() supports sharding? In that case I'd rename the test.
There was a problem hiding this comment.
yeah did that independently because of issues downstream with non matching shards / chunks. These errors are specific and not connected to our implementation per se so I kept it separate. For me it indicates that by _sharding_chunking which is seen not to be tested in the other functions.
|
Looks good in general. There are a few decisions to be made, but once done most of the requested changes can be one-shot by an agent. |
chore: fix typo in docstrings
…to support_sharding
bump ome_zarr remove distributed add platformdirs
LucaMarconato
left a comment
There was a problem hiding this comment.
Changes look good so far.
We will opt for scverse-misc settings in a follow up PR
This PR adds the following:
raster_write_kwargsfor io functions like.writeand.write_element. This also adds the ability to write sharded arrays. Support for anndata sharding is to be added in a follow up PR.raster_write_kwargsargument.raster_chunksandraster_shards. The config can now be stored in a default location or a custom location. Additionally, environment variables can be set to temporarily override the values.Additional changes
@LucaMarconato