Sophiex/kerem/pr/transformer head #1649

sophie-xhonneux · 2026-01-17T12:42:54Z

Description

See PR #1590

Issue Number

Closes #1587

Is this PR a draft? Mark it as draft.

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

clessig

trainer.py and trainer_base.py need to be cleaned up, please.

clessig · 2026-01-21T07:22:51Z

config/config_physical_jepa.yml

 # currently fixed to 1.0 (due to limitations with flex_attention and triton)
 forecast_att_dense_rate: 1.0

+sslpred_num_blocks: 12


I don't think these params should go into the model block but part of the SSL loss term. This here is really also JEPA specific.

Sophie took care of this

clessig · 2026-01-21T07:24:37Z

src/weathergen/model/engines.py

+        self.pred_blocks = nn.ModuleList()
+
+        # first map to intermediate_dim to introduce a bottleneck
+        self.pred_blocks.append(nn.Linear(in_dim, intermediate_dim, bias=False))


We should call this blocks in all modules.

clessig · 2026-01-21T07:26:11Z

src/weathergen/model/engines.py

+        # we concatenate the patch and class tokens to process them together
+        # We concatenate in the token dimension [Batch, Tokens, Dim]
+        patch_class_tokens = []
+        if self.class_token:


We should use x.class_token and x.patch_token here.

I renamed these to

self.use_class_token = use_class_token self.use_patch_token = use_patch_token

clarify that these are boolians. Now t looks like in the forward()

if self.use_class_token: patch_class_tokens.append(x.class_token) if self.use_patch_token: patch_class_tokens.append(x.patch_tokens) patch_class_tokens = torch.cat(patch_class_tokens, dim=1)

Better although I think it's still duplicate and using x.class_token and x.patch_token is more robust.

clessig · 2026-01-21T07:26:56Z

src/weathergen/model/engines.py

+            if isinstance(block, torch.nn.modules.normalization.LayerNorm):
+                patch_class_tokens = block(patch_class_tokens)
+            else:
+                patch_class_tokens = checkpoint(block, patch_class_tokens, use_reentrant=False)


The checkpoint should be on a coarser level if possible.

I did not understand this comment :)

The checkpoint should wrap the call to this forward function, not be in the forward function

I did not understand this comment :)

The checkpoint should wrap the call to this forward function, not be in the forward function

clessig · 2026-01-21T07:27:43Z

src/weathergen/model/engines.py

        self.patch_token = patch_token
        # For now this is a Linear Layer TBD what this architecture should be
-        self.layer = nn.Linear(in_dim, out_dim, bias=False)
+        self.layer = MLP(in_dim, out_dim, num_layers, hidden_factor)


self.layer -> self.blocks

clessig · 2026-01-21T07:37:09Z

src/weathergen/train/trainer.py

+            print("Happy to be here")
            batch.to_device(self.device)

+            print("Batch to device")


clessig · 2026-01-21T07:37:15Z

src/weathergen/train/trainer.py

                    self.training_cfg.window_offset_prediction,
                )

+                print("Model predictions")


clessig · 2026-01-21T07:37:20Z

src/weathergen/train/trainer.py

                        self.model,
                        self.training_cfg.window_offset_prediction,
                    )
+                print("target predictions")


clessig · 2026-01-21T07:37:24Z

src/weathergen/train/trainer.py

                targets_and_aux=targets_and_auxs,
                metadata=extract_batch_metadata(batch),
            )
+            print("loss calcuclation")


done.

removed more print() statements

clessig · 2026-01-21T07:38:39Z

src/weathergen/train/trainer_base.py

            return torch.device("cpu")

-        local_id_node = os.environ.get("SLURM_LOCALID", "-1")
+        local_id_node = os.environ.get("LOCAL_RANK", os.environ.get("SLURM_LOCALID", "-1"))


This should be torch.distributed.get_local_rank(), unless someone can explain why it's not suitable here.

I am not sure why this even changed...

Using dist.get_node_local_rank (fallback_rank=-1) instead

kctezcan and others added 12 commits January 13, 2026 08:31

WIP added a predictor class

bc19177

using the transformer predictor for jepa

f3d81b8

lint

d3fc692

Merge branch 'develop' into ktezcan/dev/iss1587_predictor_jepa

0d23560

added pred_ params in the test config

37145b3

renamed params to sslpred_

3e9372b

merged develop

1dcd781

lint

c845509

Merge branch 'develop' into ktezcan/dev/iss1587_predictor_jepa

3a00e40

added the only jepa config

2048d16

Merge branch 'develop' into sophiex/kerem/pr/transformer-head

bca9097

Clean-up config and model create

399a97e

github-project-automation bot added this to WeatherGen-dev Jan 17, 2026

github-actions bot added the model Related to model training or definition (not generic infra) label Jan 17, 2026

sophie-xhonneux mentioned this pull request Jan 17, 2026

Transformer predictor for JEPA #1590

Open

4 tasks

sophie-xhonneux added 2 commits January 17, 2026 15:12

Current status

aa2e34a

Now it works

f0dac8c

clessig reviewed Jan 21, 2026

View reviewed changes

Sophiex/kerem/pr/transformer head #1649

Are you sure you want to change the base?

Sophiex/kerem/pr/transformer head #1649

Uh oh!

Conversation

sophie-xhonneux commented Jan 17, 2026

Description

Issue Number

Checklist before asking for review

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants