Skip to content

[2370][model] Make the EncoderModule self-contained#2372

Draft
Tewson1 wants to merge 9 commits into
ecmwf:developfrom
MeteoSwiss:tsivam/dev/iss2370_encoder
Draft

[2370][model] Make the EncoderModule self-contained#2372
Tewson1 wants to merge 9 commits into
ecmwf:developfrom
MeteoSwiss:tsivam/dev/iss2370_encoder

Conversation

@Tewson1
Copy link
Copy Markdown

@Tewson1 Tewson1 commented May 18, 2026

Description

Moves all HealPix-level dependent postitional encoding parameters out of ModelParams and into EncoderModule directly, making the EncoderModule self-contained. Previously pe_global, q_cells_lens, rope_coords, and rope_cell_coords lived in a shared ModelParams object that had to be passed into EncoderModule.forward at every call. This made it impossible to instantiate multiple encoders at different HEALPix resolutions

Issue Number

Closes #2370

CC: @clessig

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

Copy link
Copy Markdown
Collaborator

@clessig clessig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the implementation. I had a very quick look and I think we need separate ROPE coords also for the forecasting now

Comment thread src/weathergen/model/model.py Outdated
if without_grad:
# Pushforward mode: advance tokens without grad; no decoding with torch.no_grad():
tokens = self.forecast_engine(tokens, step, model_params.rope_coords)
tokens = self.forecast_engine(tokens, step, self.encoder.rope_coords)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have multiple healpix levels on multiple decoders, this won't work. I think we need to have separate rope coords for the forecast engine in it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Can you explain why we need seperate rope coords for the forecast engine. At the end, same long/lat points should have the same rope_coords or am I mixing something up?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have healpix level 5 and healpix level 7 for the input and combine them on a healpix level 6 latent grid. The forecast engine would operate on this one.

@github-actions github-actions Bot added the model Related to model training or definition (not generic infra) label May 18, 2026
@Tewson1 Tewson1 force-pushed the tsivam/dev/iss2370_encoder branch from 12e124c to a95bc9e Compare May 18, 2026 20:51
@Tewson1 Tewson1 force-pushed the tsivam/dev/iss2370_encoder branch from e5fddd5 to 585fe84 Compare May 19, 2026 09:44
@clessig
Copy link
Copy Markdown
Collaborator

clessig commented May 19, 2026

@Tewson1 : thanks for the hard work. Could you summarize which cases you have tested?

@Tewson1
Copy link
Copy Markdown
Author

Tewson1 commented May 20, 2026

@Tewson1 : thanks for the hard work. Could you summarize which cases you have tested?

I haven't really started testing. I have launched a job (32 epochs training, 16 epochs finetuning and also inference) to see if the training and model performance behaves the same as before (I also created this hedgedoc: https://gitlab.jsc.fz-juelich.de/hedgedoc/kSSjIW6gQ9OWL_c8ByWTJQ). The job failed last evening because of a bug in trainer.py. We have now buffers/plain vectors in state_dict and this triggers an error when it passes .full_tensor(). I fixed it using a simple if else ,maybe there is a more elegant way to approach this problem.

What are the other approaches to testing ? Thanks!

@clessig
Copy link
Copy Markdown
Collaborator

clessig commented May 20, 2026

We should test with ERA5 and multiple datasets. We also need to check with SSL and latent diffusion but maybe thinking about these cases is enough.

@Tewson1
Copy link
Copy Markdown
Author

Tewson1 commented May 21, 2026

We should test with ERA5 and multiple datasets. We also need to check with SSL and latent diffusion but maybe thinking about these cases is enough.

@clessig Thanks for the response! Do you think that 32 epochs of training and 16 epochs of finetuning is enough.
If yes, then I would continue with the following run:

  • to cover multiple datasets (+ ERA5) I would use the config_forecasting.yml with the stream config era5_georing_avhrr

I was looking through the code and it seems that training_mode doesn't affect the encoder. So I am also not sure either. Otherwise I am considering a run where I use config_jepa,yml for pretraining and config_jepa_finetuning.yml for finetuning

@clessig
Copy link
Copy Markdown
Collaborator

clessig commented May 21, 2026

and it seems that training_mode doesn't affect the encoder. So I am also not sure either. Otherwise I am considering a run where I use config_jepa,yml for pretraining and config_jepa_finetuning.yml for finetuning

32+16 is definitely enough. I think if it works mechanically then also convergence shouldn't work. But it would be important to test with multiple datasets and pretraining + finetuning (mainly to check model loading) and inference. Also run ./scripts/actions.sh integration-test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Related to model training or definition (not generic infra)

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Make the encoder self-contained

2 participants