[2370][model] Make the EncoderModule self-contained by Tewson1 · Pull Request #2372 · ecmwf/WeatherGenerator

Tewson1 · 2026-05-18T14:35:48Z

Description

Moves all HealPix-level dependent postitional encoding parameters out of ModelParams and into EncoderModule directly, making the EncoderModule self-contained. Previously pe_global, q_cells_lens, rope_coords, and rope_cell_coords lived in a shared ModelParams object that had to be passed into EncoderModule.forward at every call. This made it impossible to instantiate multiple encoders at different HEALPix resolutions

Issue Number

Closes #2370

CC: @clessig

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

clessig

Thanks for the implementation. I had a very quick look and I think we need separate ROPE coords also for the forecasting now

clessig · 2026-05-18T14:55:50Z

            if without_grad:
                # Pushforward mode: advance tokens without grad; no decoding with torch.no_grad():
-                tokens = self.forecast_engine(tokens, step, model_params.rope_coords)
+                tokens = self.forecast_engine(tokens, step, self.encoder.rope_coords)


If we have multiple healpix levels on multiple decoders, this won't work. I think we need to have separate rope coords for the forecast engine in it.

Thanks for the review. Can you explain why we need seperate rope coords for the forecast engine. At the end, same long/lat points should have the same rope_coords or am I mixing something up?

We could have healpix level 5 and healpix level 7 for the input and combine them on a healpix level 6 latent grid. The forecast engine would operate on this one.

…ors to buffers

clessig · 2026-05-19T19:25:25Z

@Tewson1 : thanks for the hard work. Could you summarize which cases you have tested?

…n vectors/buffers like pe_embed

Tewson1 · 2026-05-20T10:31:08Z

@Tewson1 : thanks for the hard work. Could you summarize which cases you have tested?

I haven't really started testing. I have launched a job (32 epochs training, 16 epochs finetuning and also inference) to see if the training and model performance behaves the same as before (I also created this hedgedoc: https://gitlab.jsc.fz-juelich.de/hedgedoc/kSSjIW6gQ9OWL_c8ByWTJQ). The job failed last evening because of a bug in trainer.py. We have now buffers/plain vectors in state_dict and this triggers an error when it passes .full_tensor(). I fixed it using a simple if else ,maybe there is a more elegant way to approach this problem.

What are the other approaches to testing ? Thanks!

clessig · 2026-05-20T17:51:19Z

We should test with ERA5 and multiple datasets. We also need to check with SSL and latent diffusion but maybe thinking about these cases is enough.

Tewson1 · 2026-05-21T08:58:32Z

We should test with ERA5 and multiple datasets. We also need to check with SSL and latent diffusion but maybe thinking about these cases is enough.

@clessig Thanks for the response! Do you think that 32 epochs of training and 16 epochs of finetuning is enough.
If yes, then I would continue with the following run:

to cover multiple datasets (+ ERA5) I would use the config_forecasting.yml with the stream config era5_georing_avhrr

I was looking through the code and it seems that training_mode doesn't affect the encoder. So I am also not sure either. Otherwise I am considering a run where I use config_jepa,yml for pretraining and config_jepa_finetuning.yml for finetuning

clessig · 2026-05-21T09:43:38Z

and it seems that training_mode doesn't affect the encoder. So I am also not sure either. Otherwise I am considering a run where I use config_jepa,yml for pretraining and config_jepa_finetuning.yml for finetuning

32+16 is definitely enough. I think if it works mechanically then also convergence shouldn't work. But it would be important to test with multiple datasets and pretraining + finetuning (mainly to check model loading) and inference. Also run ./scripts/actions.sh integration-test.

Tharmeekan and others added 2 commits May 18, 2026 13:56

Refactor EncoderModule and ModelParams

0bf0662

Merge branch 'ecmwf:develop' into tsivam/dev/iss2370_encoder

fb9588b

github-project-automation Bot added this to WeatherGen-dev May 18, 2026

clessig self-requested a review May 18, 2026 14:53

clessig reviewed May 18, 2026

View reviewed changes

github-actions Bot added the model Related to model training or definition (not generic infra) label May 18, 2026

Add ROPE coords to ForecastingEngine and update reset_parameters method

a95bc9e

Tewson1 force-pushed the tsivam/dev/iss2370_encoder branch from 12e124c to a95bc9e Compare May 18, 2026 20:51

clean up code formatting in encoder, engines, and model modules

585fe84

Tewson1 force-pushed the tsivam/dev/iss2370_encoder branch from e5fddd5 to 585fe84 Compare May 19, 2026 09:44

Tharmeekan added 3 commits May 19, 2026 13:41

Merge branch 'develop' into tsivam/dev/iss2370_encoder

44ceaa1

Bug fixes

77e733b

Fix FSDP2 sharding bug by converting non-trainable EncoderModule tens…

11a6fb9

…ors to buffers

Add pipeline configuration for testing; update Trainer to handle plai…

1e91127

…n vectors/buffers like pe_embed

Fix load_model when plain tensor

987929e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2370][model] Make the EncoderModule self-contained#2372

[2370][model] Make the EncoderModule self-contained#2372
Tewson1 wants to merge 9 commits into
ecmwf:developfrom
MeteoSwiss:tsivam/dev/iss2370_encoder

Tewson1 commented May 18, 2026 •

edited

Loading

Uh oh!

clessig left a comment

Uh oh!

clessig May 18, 2026

Uh oh!

Tewson1 May 18, 2026

Uh oh!

clessig May 18, 2026

Uh oh!

clessig commented May 19, 2026

Uh oh!

Tewson1 commented May 20, 2026

Uh oh!

clessig commented May 20, 2026

Uh oh!

Tewson1 commented May 21, 2026

Uh oh!

clessig commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Tewson1 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

clessig May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Tewson1 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

clessig May 18, 2026

Choose a reason for hiding this comment

Uh oh!

clessig commented May 19, 2026

Uh oh!

Tewson1 commented May 20, 2026

Uh oh!

clessig commented May 20, 2026

Uh oh!

Tewson1 commented May 21, 2026

Uh oh!

clessig commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tewson1 commented May 18, 2026 •

edited

Loading