Skip to content

Issue when adding coordinates to the output dataset #58

Description

@ealerskans

I was having issues generating a dataset from the full DANRA pressure levels data. Everything works nicely with the small test dataset in the example config (https://mllam-test-data.s3.eu-north-1.amazonaws.com/height_levels.zarr) but I was getting a lot of nans in my output when I was using the full dataset. The reason for this is apparently this part here

# Initialize the output dataset and add dimensions
ds = xr.Dataset()
ds.attrs.update(ds_input.attrs)
for dim in ds_input.dims:
ds = ds.assign_coords({dim: ds_input.coords[dim]})

more specifically the part where the dataset to be saved to zarr (ds) is assigned coordinates (lines 160-161). For the full DANRA pressure levels dataset this means that we will assign the following coordinates to ds

Coordinates:
  * time      (time) datetime64[ns] 774kB 1990-09-01 ... 2023-10-13T21:00:00
  * pressure  (pressure) int64 112B 1000 950 925 900 850 ... 400 300 250 200 100
  * y         (y) float64 5kB -6.095e+05 -6.07e+05 ... 8.58e+05 8.605e+05
  * x         (x) float64 6kB -1.999e+06 -1.997e+06 ... -3.175e+04 -2.925e+04

and if we then e.g. want to select the 1000 hPa z variable this would be a data-array, da, with only have 1 "pressure dimension"

<xarray.DataArray 'z' (time: 96768, pressure: 1, y: 589, x: 789)> Size: 360GB
dask.array<getitem, shape=(96768, 1, 589, 789), dtype=float64, chunksize=(256, 1, 295, 263), chunktype=numpy.ndarray>
Coordinates:
    lat       (y, x) float64 4MB dask.array<chunksize=(295, 263), meta=np.ndarray>
    lon       (y, x) float64 4MB dask.array<chunksize=(295, 263), meta=np.ndarray>
  * pressure  (pressure) int64 8B 1000
  * time      (time) datetime64[ns] 774kB 1990-09-01 ... 2023-10-13T21:00:00
  * x         (x) float64 6kB -1.999e+06 -1.997e+06 ... -3.175e+04 -2.925e+04
  * y         (y) float64 5kB -6.095e+05 -6.07e+05 ... 8.58e+05 8.605e+05

Then when we're trying to add this da to our dataset ds then it looks like it becomes a mismatch between the "pressure dimensions" and we get a lot of nans (I am fairly new to xarray so not really sure how it works).

However, before #34 we didn't add any coordinates to the dataset since we just added the data-arrays directly. I know that at one point when developing the derived variables feature I added this since otherwise I couldn't get it to work with variables that only had the time coordinate (e.g. the hour of day). But it doesn't seem like that is necessary any more.

I have made some tests and it looks like it should be enough to just remove these lines

for dim in ds_input.dims:
ds = ds.assign_coords({dim: ds_input.coords[dim]})

But I would like some input from someone with more experience with xarray before I make a bugfix on this @leifdenby, @observingClouds.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions