25 test two year data by rogerkuou · Pull Request #27 · ESMValGroup/ClimaNet

rogerkuou · 2026-02-27T07:38:19Z

did not finish the train-validation-test split in this PR, but made a new issue #28

rogerkuou · 2026-02-27T14:03:21Z

Hi @SarahAlidoost and @meiertgrootes, I created an exmaple training process on a subset of the two year data., and ran it on Levante. In this PR I included example SLURM training process, with an README on how to config the jobs on Levante.

A copy of the example run can be found on /work/bd0854/b380854/eso4clima. I executed the slurm task on my home dir, and copied the entire experiment here.

SarahAlidoost

@rogerkuou thanks for the script. Since the PR #29 fixed a few issues, we need to merge main into this branch. I also left some comments, mainly about the structure of the example.py and the code that should be run with slurm. If something is unclear, please let me know. In meantime, I will work on issue #33.

scripts/example.py

SarahAlidoost · 2026-03-17T14:02:15Z

scripts/example_training.py

+    return ds[["ts"]].sel(lon=lon_subset, lat=lat_subset)
+
+
+def main():


This function is currently doing a lot: creating the model, training it, making predictions, and saving results. We should split these responsibilities.
If this is a "training script", it should only handle reading the data, creating the model with the correct arguments, and passing both to a separate training function (that will be added in #33).

Then, in another script (e.g. "inference script"), we can load the saved model and make predictions (see #32). This separation is needed because training and inference require different computing resources.

Any plotting or result inspection can be done in a separate script if needed.

I have splited this into a training script and an inference script. plotting part has been removed

SarahAlidoost · 2026-03-17T14:03:18Z

scripts/example.py

+    lon_subset = slice(-10, 10)
+    lat_subset = slice(-5, 5)


slicing should not be needed, we want to work with global data on HPC.

scripts/example.py

SarahAlidoost · 2026-03-17T14:05:31Z

scripts/example_training.py

+    # Compute monthly climatology stats without persisting the full (time, lat, lon) monthly field
+    monthly_ts = daily_data["ts"].resample(time="MS").mean(skipna=True)
+    mean = monthly_ts.mean(dim=["lat", "lon"], skipna=True).compute().values
+    std = monthly_ts.std(dim=["lat", "lon"], skipna=True).compute().values
+    print(f"mean: {mean}, std: {std}")
+
+    # Make a dataset
+    dataset = STDataset(
+        daily_da=daily_data["ts"],
+        monthly_da=monthly_data["ts"],
+        land_mask=lsm_mask["lsm"],
+        patch_size=(patch_size_training, patch_size_training),
+    )


All these lines should be moved to training script in #33.

SarahAlidoost · 2026-03-17T14:06:46Z

scripts/example.py

+    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
+    decoder = model.decoder
+    with torch.no_grad():
+        decoder.bias.copy_(torch.from_numpy(mean))
+        decoder.scale.copy_(torch.from_numpy(std) + 1e-6)
+
+    # Make a dataloader
+    dataloader = DataLoader(
+        dataset,
+        batch_size=1,
+        shuffle=True,
+        pin_memory=False,
+    )
+
+    # Training process
+    best_loss = float("inf")
+    patience = 10
+    counter = 0
+    model.train()
+    for epoch in range(101):
+        for batch in dataloader:
+            optimizer.zero_grad()
+
+            daily_batch = batch["daily_patch"]
+            daily_mask = batch["daily_mask_patch"]
+            monthly_target = batch["monthly_patch"]
+            land_mask_patch = batch["land_mask_patch"][0, ...]
+            padded_days_mask = batch["padded_days_mask"]
+
+            pred = model(daily_batch, daily_mask, land_mask_patch, padded_days_mask)
+
+            ocean = (~land_mask_patch).to(pred.device)
+            ocean = ocean[None, None, :, :]
+
+            loss = (
+                torch.nn.functional.l1_loss(pred, monthly_target, reduction="none")
+                * ocean
+            )
+            loss_per_month = loss.sum(dim=(-2, -1)) / ocean.sum(dim=(-2, -1))
+            loss = loss_per_month.mean()
+
+            loss.backward()
+            optimizer.step()
+
+        if loss.item() < best_loss:
+            best_loss = loss.item()
+            counter = 0
+
+        if epoch % 20 == 0:
+            print(f"The loss is {best_loss} at epoch {epoch}")
+        else:
+            counter += 1
+            if counter >= patience:
+                print(
+                    f"No improvement for {patience} epochs, stopping early at epoch {epoch}."
+                )
+                break
+
+    print("training done!")
+    print(f"Final loss: {loss.item()}")


All these lines should be moved to training script in #33.

Agree. I will leave this to another PR

SarahAlidoost · 2026-03-17T14:07:25Z

scripts/example.py

+    # Calculate prediction and error
+    dataset_pred = STDataset(
+        daily_da=daily_data["ts"],
+        monthly_da=monthly_data["ts"],
+        land_mask=lsm_mask["lsm"],
+        patch_size=(daily_data.sizes["lat"], daily_data.sizes["lon"]),
+    )
+    dataloader_pred = DataLoader(
+        dataset_pred,
+        batch_size=len(dataset_pred),
+        pin_memory=False,
+    )
+    full_batch = next(iter(dataloader_pred))
+    daily_batch = full_batch["daily_patch"]
+    daily_mask = full_batch["daily_mask_patch"]
+    monthly_target = full_batch["monthly_patch"]
+    land_mask_patch = full_batch["land_mask_patch"][0, ...]
+    padded_days_mask = full_batch["padded_days_mask"]
+    model.eval()
+    with torch.no_grad():
+        pred = model(daily_batch, daily_mask, land_mask_patch, padded_days_mask)
+    monthly_prediction = pred_to_numpy(pred, land_mask=land_mask_patch)[0]
+    monthly_data["ts_pred"] = (("time", "lat", "lon"), monthly_prediction)


inference should be done in a separate script.

SarahAlidoost · 2026-03-17T14:08:29Z

scripts/example.py

+    # Save the trained model
+    model_save_path = Path("./models/spatio_temporal_model.pth")
+    model_save_path.parent.mkdir(parents=True, exist_ok=True)
+    torch.save(model.state_dict(), model_save_path)


If we want to load the model later, we need to know how to create the model instance. Therefore, it is better to save the model config (arguments and defaults) with the model.

SarahAlidoost · 2026-03-17T14:10:05Z

scripts/example.py

+    # Save the xr.Dataset with predictions
+    predictions_save_path = Path("./predicted_data/predictions.nc")
+    predictions_save_path.parent.mkdir(parents=True, exist_ok=True)
+    monthly_data.to_netcdf(predictions_save_path)
+    print(f"Saved model to: {model_save_path}")
+    print(f"Saved predictions to: {predictions_save_path}")


These should be moved to the inference script.

Please don't use print statement in a slurm job. Those information should be probably logged in a log file.

SarahAlidoost · 2026-03-17T14:10:52Z

scripts/example.py

+    # Plot and save inspections
+    plot_path = Path("./figures/") # local
+    plot_path.mkdir(parents=True, exist_ok=True)
+    # 1) Prediction (t=0)
+    fig, ax = plt.subplots(figsize=(8, 4))
+    monthly_data["ts_pred"].isel(time=0).plot(ax=ax)
+    fig.savefig(plot_path / "ts_pred_t0.png", dpi=200, bbox_inches="tight")
+    plt.close(fig)
+
+    # 2) Target (t=0)
+    fig, ax = plt.subplots(figsize=(8, 4))
+    monthly_data["ts"].where(~lsm_mask["lsm"].values).isel(time=0).plot(ax=ax)
+    fig.savefig(plot_path / "ts_target_t0.png", dpi=200, bbox_inches="tight")
+    plt.close(fig)
+
+    # 3) Error (t=0)
+    fig, ax = plt.subplots(figsize=(8, 4))
+    err.isel(time=0).plot(ax=ax)
+    fig.savefig(plot_path / "err_t0.png", dpi=200, bbox_inches="tight")
+    plt.close(fig)
+
+    # 4) Error (t=1)
+    fig, ax = plt.subplots(figsize=(8, 4))
+    err.isel(time=1).plot(ax=ax)
+    fig.savefig(plot_path / "err_t1.png", dpi=200, bbox_inches="tight")
+    plt.close(fig)


We dont need these in a training script. They can be done later if we have the model and the predictions saved on disk.

Co-authored-by: SarahAlidoost <55081872+SarahAlidoost@users.noreply.github.com>

rogerkuou · 2026-03-18T14:35:51Z

Hi @SarahAlidoost, thanks for the review! I implemented most of your comments:

I separated the example script to two: training and inference
Now the training scipt export models with checkpoint. I slightlt modified the class to make it returning the config
I used logging to replace the print statements
The plotting part has been removed

I did not implemente the training utility function and will leave it to #33 .

Can you give another look?

SarahAlidoost · 2026-03-20T12:13:46Z

scripts/example_training.py

+
+    # Initialize training
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    model = SpatioTemporalModel(


Can you please use the same arguments for the model as those in the example notebook in the main branch?

SarahAlidoost · 2026-03-20T12:14:09Z

scripts/example_training.py

+        decoder.scale.copy_(torch.from_numpy(std) + 1e-6)
+
+    # Make a dataloader
+    dataloader = DataLoader(


Can you please use the same arguments for the dataloader as those in the example notebook in the main branch?

SarahAlidoost · 2026-03-20T12:16:33Z

scripts/example_training.py

+    patience = 10
+    counter = 0
+    model.train()
+    for epoch in range(101):


Can you please use the same training loop as in the example notebook in the main branch?

SarahAlidoost · 2026-03-20T12:17:56Z

scripts/example_training.py

+            counter = 0
+
+        if epoch % 20 == 0:
+            logger.info(f"The loss is {best_loss} at epoch {epoch}")


where logger info will be stored, in slurm log file? 🤔

SarahAlidoost · 2026-03-20T12:23:51Z

climanet/st_encoder_decoder.py

+        self.config = {
+            'in_chans': in_chans,
+            'embed_dim': embed_dim,
+            'patch_size': patch_size,
+            'max_days': max_days,
+            'max_months': max_months,
+            'num_months': num_months,
+            'hidden': hidden,
+            'overlap': overlap,
+            'max_H': max_H,
+            'max_W': max_W,
+            'spatial_depth': spatial_depth,
+            'spatial_heads': spatial_heads,
+        }


Suggested change

self.config = {

'in_chans': in_chans,

'embed_dim': embed_dim,

'patch_size': patch_size,

'max_days': max_days,

'max_months': max_months,

'num_months': num_months,

'hidden': hidden,

'overlap': overlap,

'max_H': max_H,

'max_W': max_W,

'spatial_depth': spatial_depth,

'spatial_heads': spatial_heads,

}

SarahAlidoost

@rogerkuou thanks for addressing the comments 👍 . Here some more suggestions:

I see that the example notebook has been changed in this PR. I cannot see exactly what is changed, but since this PR is about testing large data on HPC, let's not change the example notebook.
No need to add inference script in this PR. For now we can skip that one. Let's focus on setup of the training on HPC in this PR. Also, in fixing #32 we can add inefrence script later.
After implementing these suggestions and re-running the slurm job, can you please add the slurm logfile to the PR as well? Also, can you perhaps give an indication how much resources have been used to complete the job.

If something not clear, please let me know.

initial example of two year data

df1105f

rogerkuou mentioned this pull request Feb 27, 2026

Add months mixing #24

Merged

rogerkuou added 6 commits February 27, 2026 11:42

update examples notebook

fef01cf

add example training scritps

2b3ac47

add example slurm file

2bdf8b5

update fig dir

1159fbc

add README

3e2c4b4

Merge branch 'main' into 25_test_two_year_data

994d36b

rogerkuou marked this pull request as ready for review February 27, 2026 14:00

rogerkuou requested review from SarahAlidoost and meiertgrootes February 27, 2026 14:04

SarahAlidoost requested changes Mar 17, 2026

View reviewed changes

rogerkuou and others added 6 commits March 18, 2026 13:41

Apply suggestions from code review

8cd1c8f

Co-authored-by: SarahAlidoost <55081872+SarahAlidoost@users.noreply.github.com>

fix conflicts

fe8f024

separate training and inference

a3ba05d

update model exportation with checkpoint

3c99673

add inference scripts

2b4c7c5

use logging to replace print

9ed2e00

rogerkuou requested a review from SarahAlidoost March 18, 2026 14:36

SarahAlidoost mentioned this pull request Mar 18, 2026

Add a util function for training loop #34

Draft

SarahAlidoost reviewed Mar 20, 2026

View reviewed changes

SarahAlidoost requested changes Mar 20, 2026

View reviewed changes

		return ds[["ts"]].sel(lon=lon_subset, lat=lat_subset)


		def main():

Conversation

rogerkuou commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rogerkuou commented Feb 27, 2026

Uh oh!

SarahAlidoost left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SarahAlidoost Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogerkuou Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogerkuou commented Mar 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SarahAlidoost left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rogerkuou commented Feb 27, 2026 •

edited

Loading

SarahAlidoost Mar 17, 2026 •

edited

Loading

rogerkuou Mar 18, 2026 •

edited

Loading