Sophiex/dev/feat qk rmsnorm by sophie-xhonneux · Pull Request #2033 · ecmwf/WeatherGenerator

sophie-xhonneux · 2026-03-10T18:04:08Z

Description

Improves performance see https://gitlab.jsc.fz-juelich.de/hedgedoc/OgGOfs2-RVOZcEEjB0108w?view

Issue Number

Closes #2032

Is this PR a draft? Mark it as draft.

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

…iex/dev/warm-and-frozen-teachers

…to be (at least here) identical

…x/dev/exp-week9-local2global

…rameter for us

clessig · 2026-03-15T19:04:40Z

    return config


+def _check_qk_norm_type(config: Config) -> Config:


This backfilling leads to problems elsewhere. We can just use cf.get( ...)

This was necessary I think for loading the teacher model somehow. I forget the details

Remove. This should not be added here. _check_logging above let to all kinds of problems lately. Using cf.get("config.qk_norm_type", "LayerNorm") is much more robust and useful

clessig · 2026-03-15T19:07:51Z

        return tokens_global_c


+class Local2GlobalSumEngine(torch.nn.Module):


Separate PR

Assuming the PR with this change is merged first

Which PR is this?

…nto sophiex/dev/feat-qk-rmsnorm

clessig

Thanks, PR looks good (has been rebased to latest develop). Can be merged with two changes:

Remove the sum aggregation engine (or merge a clean PR with this before)
Move the fix(?) / change to plot_utils.py to a separate PR--also see my comment there

@shmh40 : can you approve and merge

clessig · 2026-03-29T12:19:33Z

    return config


+def _check_qk_norm_type(config: Config) -> Config:


Remove. This should not be added here. _check_logging above let to all kinds of problems lately. Using cf.get("config.qk_norm_type", "LayerNorm") is much more robust and useful

clessig · 2026-03-29T14:22:33Z

        return tokens_global_c


+class Local2GlobalSumEngine(torch.nn.Module):


Which PR is this?

shmh40

All good with me.

Sophie Xhonneux and others added 18 commits February 18, 2026 19:11

Write first solution with Claude

d6d51da

Add test configs, works on santis

a6e4c25

Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into soph…

e65241a

…iex/dev/warm-and-frozen-teachers

Merge branch 'develop' into sophiex/dev/warm-and-frozen-teachers

66de4c0

Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into soph…

9fbe081

…iex/dev/warm-and-frozen-teachers

Disabling rope; removing model config from finetuning since it needs …

af9a02c

…to be (at least here) identical

Merge branch 'develop' into sophiex/dev/warm-and-frozen-teachers

da23d8a

Add new JEPA config

1ccad2b

Commit experiment

019357a

Add alternative Local2Global + fix grad norm comp

d6d1ebb

Update configs and plotting configs

90213f8

Merge branch 'sophiex/dev/exp-week8-frozen-sweep-mask-lr' into sophie…

d11ca16

…x/dev/exp-week9-local2global

Test new engine

81c847b

Try RMSNorm for the QK norm, important caveat this has a learnable pa…

a5c973f

…rameter for us

Add missing changes

7b4987d

Add configs

85b8d68

Try long runs

e35a92d

Delete superfluous configs

3588278

github-project-automation bot added this to WeatherGen-dev Mar 10, 2026

github-actions bot added eval anything related to the model evaluation pipeline infra Issues related to infrastructure model Related to model training or definition (not generic infra) labels Mar 10, 2026

clessig reviewed Mar 15, 2026

View reviewed changes

sophie-xhonneux added the model:pretrain label Mar 26, 2026

clessig added 4 commits March 29, 2026 16:11

Merge branch 'develop' of https://github.com/ecmwf/WeatherGenerator i…

ff301b5

…nto sophiex/dev/feat-qk-rmsnorm

Removed dependency on evaluation package

810f0b5

Removed stale code

5a24357

Linting

47328a8

clessig reviewed Mar 29, 2026

View reviewed changes

shmh40 self-requested a review April 1, 2026 13:59

shmh40 approved these changes Apr 1, 2026

View reviewed changes

Clean up

c630a3a

sophie-xhonneux merged commit f42e906 into develop Apr 10, 2026
5 checks passed

github-project-automation bot moved this to Done in WeatherGen-dev Apr 10, 2026

		return config


		def _check_qk_norm_type(config: Config) -> Config:

		return tokens_global_c


		class Local2GlobalSumEngine(torch.nn.Module):

Conversation

sophie-xhonneux commented Mar 10, 2026

Description

Issue Number

Checklist before asking for review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shmh40 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants