[1050] Upgrade to torch 2.9.1 by florianscheidl · Pull Request #2310 · ecmwf/WeatherGenerator

florianscheidl · 2026-05-05T10:05:27Z

Description

We updated to torch 2.9.1 for GPU systems (at cuda 12.6, as previously).

This should resolve some bugs (e.g., profiling memory bug) and allow us to use more up-to-date packages relying on newer torch versions.
We set exclude-newer = "2026-04-27T00:00:00Z".
Tested one training before and after the changes on the default config to check performance impact:

../WeatherGenerator-private/hpc/launch-slurm.py --time 10

Jupiter (before: leqmx1by, 360 samples; after: aehsu1ba, 380 samples)
Santis (before: w05m9oir, 330 samples; after: jdltacwy, 340 samples)

Moreover, we checked that Juwels and HPC2020 support the newer Nvidia GPU drivers (in fact, they would all support CUDA 13.1). Leonardo only supports up to CUDA 12.6.

From the training runs, we see no errors, nor performance degradation from the version bump.

Note: We are consciously not upgrading to the most recent torch/cuda versions our hardware supports because the current most recent flash attention version only supports up to torch 2.9 and cuda 12.*, i.e., the version we choose here. We are investigating flash-attn-4 (see #2181), but as this is still in progress, we do this step-wise upgrade.

Issue Number

Closes #1050.

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

…dl/WeatherGenerator into flo-upgrade-torch-2.9-cu129

This reverts commit 2334509.

…dl/WeatherGenerator into flo-upgrade-torch-2.9-cu129

florianscheidl · 2026-05-22T14:12:02Z

@clessig, I think we're ready to merge this early next week. I've tested it on Jupiter, Santis, and Leonardo.

florianscheidl and others added 2 commits May 5, 2026 11:37

upgrade pytorch and flashattn wip

7f67a8e

update download paths

dc6d05b

github-project-automation Bot added this to WeatherGen-dev May 5, 2026

florianscheidl and others added 5 commits May 5, 2026 12:06

Merge branch 'develop' into flo-upgrade-torch-2.9-cu129

cce677e

Add Macos support

5a6ea7b

Merge branch 'flo-upgrade-torch-2.9-cu129' of github.com:florianschei…

9fac59c

…dl/WeatherGenerator into flo-upgrade-torch-2.9-cu129

remove macos specific support

be79b44

downgrade to cu126 again

2334509

florianscheidl changed the title ~~[1050] Upgrade to torch 2.9 and cu129~~ [1050] Upgrade to torch 2.9.1 and cu126 May 5, 2026

florianscheidl marked this pull request as ready for review May 6, 2026 07:44

florianscheidl changed the title ~~[1050] Upgrade to torch 2.9.1 and cu126~~ [1050] Upgrade to torch 2.9.1 May 6, 2026

florianscheidl and others added 6 commits May 6, 2026 15:30

Specify build dependency to avoid dependency issues on specific wheels

28c3ee9

Revert "downgrade to cu126 again"

12838d4

This reverts commit 2334509.

Merge branch 'flo-upgrade-torch-2.9-cu129' of github.com:florianschei…

2e7a301

…dl/WeatherGenerator into flo-upgrade-torch-2.9-cu129

Reset by hand

e62d52b

Trying pinned ABIfalse for x_86

e249a03

Merge branch 'develop' into flo-upgrade-torch-2.9-cu129

bc65b2e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1050] Upgrade to torch 2.9.1#2310

[1050] Upgrade to torch 2.9.1#2310
florianscheidl wants to merge 13 commits into
ecmwf:developfrom
florianscheidl:flo-upgrade-torch-2.9-cu129

florianscheidl commented May 5, 2026 •

edited

Loading

Uh oh!

florianscheidl commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

florianscheidl commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

florianscheidl commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

florianscheidl commented May 5, 2026 •

edited

Loading