refactor checkpointing by nkern · Pull Request #6 · nkern/cosmo_diffusion

nkern · 2026-05-01T19:43:06Z

refactor checkpointing to use accelerate.save_state(). no more pickles for noise_scheduler, lr_scheduler, and optimizer.

due to how pytorch saves state dicts, this means we have to "know" the class names before loading, which is why we now write a ckpt_config.yaml to the checkpoint directory with this info.

In principle, we could get this info from the original config.yaml file, but the idea here is to make train() be able to operate on its own, without needing a config.yaml

also fixed a bug that allows the training to be resumed from a checkpoint

nkern added 2 commits May 1, 2026 12:19

refactor checkpointing

d865b2a

updated resume_from_chkpt epoch # bug fix

cb76137

nkern merged commit fc74b3a into main May 5, 2026
2 checks passed

nkern deleted the refactor_checkpoint branch May 5, 2026 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor checkpointing#6

refactor checkpointing#6
nkern merged 2 commits into
mainfrom
refactor_checkpoint

nkern commented May 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nkern commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nkern commented May 1, 2026 •

edited

Loading