Skip to content

Sgrasse/develop/1526 update io#1908

Open
grassesi wants to merge 40 commits intoecmwf:developfrom
grassesi:sgrasse/develop/1526-update-io
Open

Sgrasse/develop/1526 update io#1908
grassesi wants to merge 40 commits intoecmwf:developfrom
grassesi:sgrasse/develop/1526-update-io

Conversation

@grassesi
Copy link
Contributor

@grassesi grassesi commented Feb 23, 2026

Description

Issue Number

Closes #1526

Is this PR a draft? Mark it as draft.

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@grassesi grassesi force-pushed the sgrasse/develop/1526-update-io branch from 8ed6d54 to b7c57c1 Compare March 3, 2026 14:55
@grassesi grassesi marked this pull request as ready for review March 3, 2026 14:59
@grassesi grassesi requested a review from clessig March 5, 2026 06:43
@github-actions github-actions bot added infra Issues related to infrastructure model Related to model training or definition (not generic infra) labels Mar 9, 2026
@github-project-automation github-project-automation bot moved this to In Progress in WeatherGen-dev Mar 11, 2026
Copy link
Contributor

@florianscheidl florianscheidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many nice changes, to assess the functional improvements, I'd need to more time to understand the changes in depth.

Some higher-level comments:

  1. For typing, it might be nice to add some static typechecking like pyright or ty (Astral)
  2. Would be helpful to have more docstrings/explainers for the new classes and functions introduced in Writer
  3. What do we use for logging? I've seen a couple of raise and specific errors but no logging.

I'll test it on our HPC once we've implemented the fixes.

Copy link
Collaborator

@clessig clessig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments. I will also rebase.

self._append_dataset(self.source, "source")

if self.key.with_target(forecast_offset):
# TODO requiring target data currently prevents predictions with unknowable target
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the code work when target is None? Or should we have a tensor of size (0,num_channels)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No currently neither option works. But it should work with target=None Writing an entire extra dataset to zarr that is totally empty is a bad band aid solution. I think we should rename ItemKey.with_target to ItemKey.with_prediction (since it is the prediction we are always interested in if the target is present it just means some ground truth is available and if not that is also fine) and require just that prediction is not None. Will implement.


# device of the tensors in the batch
device: str | torch.device
The data a instance contains is associated with one particular inital sampling window.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should replace the whole paragraph:

Class representing one batch processed by the model.

This is not the place to explain how some of the general processing works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least I was missing this context when processing the data, where should I put the information instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiStreamDataSampler is responsible for assembling batches

targets = [
targets
for loss_name, targets in targets_and_auxs.items()
if loss_name == PHYSICAL_LOSS_KEY
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about latent outputs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible I would like to keep this out of scope for now and wait for #1860 to be finalized.

try: # TODO: do this in config
_output_streams = val_cfg.output.streams
if _output_streams is None:
raise ConfigAttributeError("")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should generate a warning and then terminate regularly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently not every training/validation/test config implements a "output" section, this guarantees a fallback/default. If the output section is missing should the default instead be no output/ no-op writer? Failing instead would mean that configs where no output is needed would require to specify a "output" section.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no output section or no streams are specified or the specified streams are not present should all generate a warning and terminate regularly.

name: config for name, config in streams.items() if name in _output_streams
}
try: # TODO: do this in config
self._forecast_offset = val_cfg.forecast.offset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use self._forecast_offset = val_cfg.forecast.get( "offset", 0)--that's much more readable than the try/except

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try except also handles the case where a training/val/test config does not contain a "forecast" section. Can I always assume this is always there I am happy to change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val_cfg.get( "forecast", {}).get( "offset, 0) is better

data = self._predictions.get_physical_prediction_normalized(key, self._normalizer)
except Exception as e:
# TODO: if preds are empty so create copy of target and add ensemble dimension
# preds = [targets[0].clone().unsqueeze(0)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this. What's the problem?

return self

@property
def sample_idxs(self) -> list[int]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Contributor Author

@grassesi grassesi Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just quality of life: Use batch.target_samples.sample_idxs instead of sample_idxs = [sample .sample_idx for sample in batch.target_samples.samples]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need sample .sample_idx--this is not on current develop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infra Issues related to infrastructure model Related to model training or definition (not generic infra)

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

validation_io can be simplified

3 participants