I was having issues generating a dataset from the full DANRA pressure levels data. Everything works nicely with the small test dataset in the example config (https://mllam-test-data.s3.eu-north-1.amazonaws.com/height_levels.zarr) but I was getting a lot of nans in my output when I was using the full dataset. The reason for this is apparently this part here
|
# Initialize the output dataset and add dimensions |
|
ds = xr.Dataset() |
|
ds.attrs.update(ds_input.attrs) |
|
for dim in ds_input.dims: |
|
ds = ds.assign_coords({dim: ds_input.coords[dim]}) |
more specifically the part where the dataset to be saved to zarr (ds) is assigned coordinates (lines 160-161). For the full DANRA pressure levels dataset this means that we will assign the following coordinates to ds
Coordinates:
* time (time) datetime64[ns] 774kB 1990-09-01 ... 2023-10-13T21:00:00
* pressure (pressure) int64 112B 1000 950 925 900 850 ... 400 300 250 200 100
* y (y) float64 5kB -6.095e+05 -6.07e+05 ... 8.58e+05 8.605e+05
* x (x) float64 6kB -1.999e+06 -1.997e+06 ... -3.175e+04 -2.925e+04
and if we then e.g. want to select the 1000 hPa z variable this would be a data-array, da, with only have 1 "pressure dimension"
<xarray.DataArray 'z' (time: 96768, pressure: 1, y: 589, x: 789)> Size: 360GB
dask.array<getitem, shape=(96768, 1, 589, 789), dtype=float64, chunksize=(256, 1, 295, 263), chunktype=numpy.ndarray>
Coordinates:
lat (y, x) float64 4MB dask.array<chunksize=(295, 263), meta=np.ndarray>
lon (y, x) float64 4MB dask.array<chunksize=(295, 263), meta=np.ndarray>
* pressure (pressure) int64 8B 1000
* time (time) datetime64[ns] 774kB 1990-09-01 ... 2023-10-13T21:00:00
* x (x) float64 6kB -1.999e+06 -1.997e+06 ... -3.175e+04 -2.925e+04
* y (y) float64 5kB -6.095e+05 -6.07e+05 ... 8.58e+05 8.605e+05
Then when we're trying to add this da to our dataset ds then it looks like it becomes a mismatch between the "pressure dimensions" and we get a lot of nans (I am fairly new to xarray so not really sure how it works).
However, before #34 we didn't add any coordinates to the dataset since we just added the data-arrays directly. I know that at one point when developing the derived variables feature I added this since otherwise I couldn't get it to work with variables that only had the time coordinate (e.g. the hour of day). But it doesn't seem like that is necessary any more.
I have made some tests and it looks like it should be enough to just remove these lines
|
for dim in ds_input.dims: |
|
ds = ds.assign_coords({dim: ds_input.coords[dim]}) |
But I would like some input from someone with more experience with xarray before I make a bugfix on this @leifdenby, @observingClouds.
I was having issues generating a dataset from the full DANRA pressure levels data. Everything works nicely with the small test dataset in the example config (https://mllam-test-data.s3.eu-north-1.amazonaws.com/height_levels.zarr) but I was getting a lot of nans in my output when I was using the full dataset. The reason for this is apparently this part here
mllam-data-prep/mllam_data_prep/create_dataset.py
Lines 157 to 161 in 3a48c99
more specifically the part where the dataset to be saved to zarr (
ds) is assigned coordinates (lines 160-161). For the full DANRA pressure levels dataset this means that we will assign the following coordinates todsand if we then e.g. want to select the 1000 hPa
zvariable this would be a data-array,da, with only have 1 "pressure dimension"Then when we're trying to add this
dato our datasetdsthen it looks like it becomes a mismatch between the "pressure dimensions" and we get a lot of nans (I am fairly new to xarray so not really sure how it works).However, before #34 we didn't add any coordinates to the dataset since we just added the data-arrays directly. I know that at one point when developing the derived variables feature I added this since otherwise I couldn't get it to work with variables that only had the
timecoordinate (e.g. the hour of day). But it doesn't seem like that is necessary any more.I have made some tests and it looks like it should be enough to just remove these lines
mllam-data-prep/mllam_data_prep/create_dataset.py
Lines 160 to 161 in 3a48c99
But I would like some input from someone with more experience with xarray before I make a bugfix on this @leifdenby, @observingClouds.