Skip to content

[1050] Upgrade to torch 2.9.1#2310

Open
florianscheidl wants to merge 13 commits into
ecmwf:developfrom
florianscheidl:flo-upgrade-torch-2.9-cu129
Open

[1050] Upgrade to torch 2.9.1#2310
florianscheidl wants to merge 13 commits into
ecmwf:developfrom
florianscheidl:flo-upgrade-torch-2.9-cu129

Conversation

@florianscheidl
Copy link
Copy Markdown
Contributor

@florianscheidl florianscheidl commented May 5, 2026

Description

We updated to torch 2.9.1 for GPU systems (at cuda 12.6, as previously).

  • This should resolve some bugs (e.g., profiling memory bug) and allow us to use more up-to-date packages relying on newer torch versions.
  • We set exclude-newer = "2026-04-27T00:00:00Z".
  • Tested one training before and after the changes on the default config to check performance impact:
../WeatherGenerator-private/hpc/launch-slurm.py --time 10 
  1. Jupiter (before: leqmx1by, 360 samples; after: aehsu1ba, 380 samples)
  2. Santis (before: w05m9oir, 330 samples; after: jdltacwy, 340 samples)

Moreover, we checked that Juwels and HPC2020 support the newer Nvidia GPU drivers (in fact, they would all support CUDA 13.1). Leonardo only supports up to CUDA 12.6.

From the training runs, we see no errors, nor performance degradation from the version bump.

Note: We are consciously not upgrading to the most recent torch/cuda versions our hardware supports because the current most recent flash attention version only supports up to torch 2.9 and cuda 12.*, i.e., the version we choose here. We are investigating flash-attn-4 (see #2181), but as this is still in progress, we do this step-wise upgrade.

Issue Number

Closes #1050.

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@florianscheidl florianscheidl changed the title [1050] Upgrade to torch 2.9 and cu129 [1050] Upgrade to torch 2.9.1 and cu126 May 5, 2026
@florianscheidl florianscheidl marked this pull request as ready for review May 6, 2026 07:44
@florianscheidl florianscheidl changed the title [1050] Upgrade to torch 2.9.1 and cu126 [1050] Upgrade to torch 2.9.1 May 6, 2026
@florianscheidl
Copy link
Copy Markdown
Contributor Author

@clessig, I think we're ready to merge this early next week. I've tested it on Jupiter, Santis, and Leonardo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Update to pytorch 2.7+

1 participant