Hi there! I've been processing 2D wave spectra grib files using kerchunk and have found that it doesn't correctly identify the coordinates, nor the dimensions of the data . The variable d2fd has the dimensions (directionNumber, frequencyNumber, latitude, longitude), so a 2D array per lat/lon point. Xarray and cfgrib read the data format correctly, but kerchunk does not. See the below,
cfgrib
xr.load_dataset("ec_spectra/T1P12180000010106001", engine="cfgrib")
<xarray.Dataset> Size: 3MB
Dimensions: (directionNumber: 36, frequencyNumber: 29, latitude: 21,
longitude: 31)
Coordinates:
number int64 8B 0
time datetime64[ns] 8B 2024-12-18
step timedelta64[ns] 8B 14 days 06:00:00
meanSea float64 8B 0.0
* directionNumber (directionNumber) int64 288B 1 2 3 4 5 6 ... 32 33 34 35 36
* frequencyNumber (frequencyNumber) int64 232B 1 2 3 4 5 6 ... 25 26 27 28 29
* latitude (latitude) float64 168B -38.0 -38.1 -38.2 ... -39.9 -40.0
* longitude (longitude) float64 248B 141.0 141.1 141.2 ... 143.9 144.0
valid_time datetime64[ns] 8B 2025-01-01T06:00:00
Data variables:
d2fd (directionNumber, frequencyNumber, latitude, longitude) float32 3MB ...
Attributes:
GRIB_edition: 1
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_subCentre: 0
Conventions: CF-1.7
institution: European Centre for Medium-Range Weather Forecasts
history: 2025-01-23T14:57 GRIB to CDM+CF via cfgrib-0.9.1...
kerchunk
single_ref = scan_grib("./ec_spectra/T1P12180000010100001")
print(f"Lenght of single_ref: {len(single_ref)}")
with open("single_ref.json", "w") as f:
json.dump(single_ref[0], f)
fs = fsspec.filesystem("reference", fo="single_ref.json")
xr.open_dataset(fs.get_mapper(""), engine="zarr")
Lenght of single_ref: 1
/home/gpickering/projects/welvops-Py/.venv/lib/python3.10/site-packages/xarray/backends/api.py:571: RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider:
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
backend_ds = backend.open_dataset(
<xarray.Dataset> Size: 6kB
Dimensions: (latitude: 21, longitude: 31)
Coordinates:
* latitude (latitude) float64 168B -38.0 -38.1 -38.2 ... -39.8 -39.9 -40.0
* longitude (longitude) float64 248B 141.0 141.1 141.2 ... 143.8 143.9 144.0
meanSea float64 8B ...
number int64 8B ...
step timedelta64[ns] 8B ...
time datetime64[ns] 8B ...
valid_time datetime64[ns] 8B ...
Data variables:
d2fd (latitude, longitude) float64 5kB ...
Attributes:
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_edition: 1
GRIB_subCentre: 0
institution: European Centre for Medium-Range Weather Forecasts
I've been trying to understand the problem but unfortunately haven't reached a solution, here's a few things I've noticed though.
- cfgrib has a specific variable for spectra keys (directionNumber and frequencyNumber). Kerchunk uses cfgrib.dataset.COORD_ATTRS in a few places to loop over possible coordinates. This doesn't contain either directionNumber, nor frequencyNumber.
- _split_file doesn't break the grib file into its component messages (I think it's meant to?). The example file I've been testing has 1044 messages. I copied the ECMWF example on reading messages from a byte stream to create a new _split_file but that wasn't a complete fix.
$ grib_ls ./notebooks/ec_spectra/T1P12180000010100001 | tail -n 10
1 ecmf meanSea 0 20241218 336 fc 2dfd grid_simple regular_ll
1 ecmf meanSea 0 20241218 336 fc 2dfd grid_simple regular_ll
1 ecmf meanSea 0 20241218 336 fc 2dfd grid_simple regular_ll
1 ecmf meanSea 0 20241218 336 fc 2dfd grid_simple regular_ll
1 ecmf meanSea 0 20241218 336 fc 2dfd grid_simple regular_ll
1 ecmf meanSea 0 20241218 336 fc 2dfd grid_simple regular_ll
1 ecmf meanSea 0 20241218 336 fc 2dfd grid_simple regular_ll
1044 of 1044 messages in ./notebooks/ec_spectra/T1P12180000010100001
1044 of 1044 total messages in 1 files
def _split_file2(f: io.FileIO, skip=0):
data = f.read()
offset = 0
part = 0
while offset < len(data):
h = codes_new_from_message(data[offset:])
total_length = codes_get(h, "totalLength")
yield offset, total_length, data[offset : offset + total_length]
codes_release(h)
offset += total_length
part += 1
if skip and part >= skip:
break
Any suggestions would be greatly appreciated!
Hi there! I've been processing 2D wave spectra grib files using kerchunk and have found that it doesn't correctly identify the coordinates, nor the dimensions of the data . The variable d2fd has the dimensions (directionNumber, frequencyNumber, latitude, longitude), so a 2D array per lat/lon point. Xarray and cfgrib read the data format correctly, but kerchunk does not. See the below,
cfgrib
kerchunk
I've been trying to understand the problem but unfortunately haven't reached a solution, here's a few things I've noticed though.
Any suggestions would be greatly appreciated!