Improve GPU utilization

- [x] Pass `fused=True` to ADAM construction
- [x] Pass `foreach=True` to `clip_grad_norm(...)` call
- [x] Do `zero_grad(set_to_none=True)` in training
- [x] Can we compute `local_voxel_count` and `global_total_voxels` outside the batch loop to save an all-reduce?
- [ ] `torch.compile()` the model
- [ ] `torch.compile()` the unscale->clip->optimizer step->update block
- [ ] Can we reduce per-batch `.item()` calls for things like gradient logging?


### Context
Pass `fused=True` to ADAM construction:  
- Fuses entire ADAM update into a single kernel per parameter group. Should reduce kernel launch overhead.

Pass `foreach=True` to `clip_grad_norm(...)` call:
- The default `clip_grad_norm` iterates per-parameter in python, launching one norm kernel per tensor. With `foreach=True`, torch uses `torch._foreach_norm` to batch all gradient norms into a single multi-tensor kernel. This collapses the dozens-to-hundreds of individual norm kernels + a python reduction loop into one fused kernel.

Do `zero_grad(set_to_none=True)` in training:
- When False, torch runs one `memset` kernel per parameter to fill gradient tensors with zeros. When `set_to_none=True` instead, it simply drops the `.grad` reference.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve GPU utilization #62

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improve GPU utilization #62

Description

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions