scan_grib unable to process spectra data

Hi there! I've been processing 2D wave spectra grib files using kerchunk and have found that it doesn't correctly identify the coordinates, nor the dimensions of the data . The variable _d2fd_ has the dimensions (directionNumber, frequencyNumber, latitude, longitude), so a 2D array per lat/lon point. Xarray and cfgrib read the data format correctly, but kerchunk does not. See the below,

**cfgrib**
```python
xr.load_dataset("ec_spectra/T1P12180000010106001", engine="cfgrib")

<xarray.Dataset> Size: 3MB
Dimensions:          (directionNumber: 36, frequencyNumber: 29, latitude: 21,
                      longitude: 31)
Coordinates:
    number           int64 8B 0
    time             datetime64[ns] 8B 2024-12-18
    step             timedelta64[ns] 8B 14 days 06:00:00
    meanSea          float64 8B 0.0
  * directionNumber  (directionNumber) int64 288B 1 2 3 4 5 6 ... 32 33 34 35 36
  * frequencyNumber  (frequencyNumber) int64 232B 1 2 3 4 5 6 ... 25 26 27 28 29
  * latitude         (latitude) float64 168B -38.0 -38.1 -38.2 ... -39.9 -40.0
  * longitude        (longitude) float64 248B 141.0 141.1 141.2 ... 143.9 144.0
    valid_time       datetime64[ns] 8B 2025-01-01T06:00:00
Data variables:
    d2fd             (directionNumber, frequencyNumber, latitude, longitude) float32 3MB ...
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2025-01-23T14:57 GRIB to CDM+CF via cfgrib-0.9.1...
```

**kerchunk**
```python
single_ref = scan_grib("./ec_spectra/T1P12180000010100001")
print(f"Lenght of single_ref: {len(single_ref)}")
with open("single_ref.json", "w") as f:
    json.dump(single_ref[0], f)

fs = fsspec.filesystem("reference", fo="single_ref.json")
xr.open_dataset(fs.get_mapper(""), engine="zarr")

Lenght of single_ref: 1
/home/gpickering/projects/welvops-Py/.venv/lib/python3.10/site-packages/xarray/backends/api.py:571: RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider:
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
  backend_ds = backend.open_dataset(
<xarray.Dataset> Size: 6kB
Dimensions:     (latitude: 21, longitude: 31)
Coordinates:
  * latitude    (latitude) float64 168B -38.0 -38.1 -38.2 ... -39.8 -39.9 -40.0
  * longitude   (longitude) float64 248B 141.0 141.1 141.2 ... 143.8 143.9 144.0
    meanSea     float64 8B ...
    number      int64 8B ...
    step        timedelta64[ns] 8B ...
    time        datetime64[ns] 8B ...
    valid_time  datetime64[ns] 8B ...
Data variables:
    d2fd        (latitude, longitude) float64 5kB ...
Attributes:
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_edition:            1
    GRIB_subCentre:          0
    institution:             European Centre for Medium-Range Weather Forecasts
```

I've been trying to understand the problem but unfortunately haven't reached a solution, here's a few things I've noticed though.

- [cfgrib has a specific variable for spectra keys (directionNumber and frequencyNumber)](https://github.com/ecmwf/cfgrib/blob/e463c5f25d701e2eb377dabfd9522ddd5d8f1836/cfgrib/dataset.py#L166). Kerchunk uses cfgrib.dataset.COORD_ATTRS in a few places to loop over possible coordinates. This doesn't contain either directionNumber, nor frequencyNumber.
- _split_file doesn't break the grib file into its component messages (I think it's meant to?). The example file I've been testing has 1044 messages. I copied the [ECMWF example](https://confluence.ecmwf.int/display/UDOC/How+do+I+decode+messages+from+a+byte+stream+-+ecCodes+FAQ) on reading messages from a byte stream to create a new _split_file but that wasn't a complete fix.
```bash
$ grib_ls ./notebooks/ec_spectra/T1P12180000010100001 | tail -n 10
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1044 of 1044 messages in ./notebooks/ec_spectra/T1P12180000010100001

1044 of 1044 total messages in 1 files
```
```python
def _split_file2(f: io.FileIO, skip=0):
    data = f.read()
    offset = 0
    part = 0

    while offset < len(data):
        h = codes_new_from_message(data[offset:])
        total_length = codes_get(h, "totalLength")
        yield offset, total_length, data[offset : offset + total_length]
        codes_release(h)
        offset += total_length
        part += 1
        if skip and part >= skip:
            break
```

Any suggestions would be greatly appreciated! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scan_grib unable to process spectra data #540

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

scan_grib unable to process spectra data #540

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions