[Bug]: Double the ram usage after upgrading to Torch 2.8

### What happened?

After updating to the latest commit I got OOM, trying to train Lora for Qwen Image bf16 with 1.0 cpu offload. During caching model loaded as usual, with 80gb of ram used. Then, once actual steps started, ram usage went up, before crashing OneTrainer. I tried manually reverting to the commit that worked before, but got OOM again. After downgrading Torch to 2.7.1 the issue disappeared.

### What did you expect would happen?

Not getting OOM with the same config that worked before

### Relevant log output

```shell
Starting UI...
C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\default.py:30: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 1173.66it/s]
TensorFlow installation not found - running with reduced feature set.
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.19.0 at http://localhost:6006/ (Press CTRL+C to quit)
The config attributes {'pooled_projection_dim': 768} were passed to QwenImageTransformer2DModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Selected layers: 720
Deselected layers: 126
Note: Enable Debug mode to see the full list of layer names
Exception in thread Reloader:
Traceback (most recent call last):
  File "C:\Users\ulexe\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "C:\Users\ulexe\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\data_ingester.py", line 108, in _reload
    self._multiplexer.Reload()
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\plugin_event_multiplexer.py", line 263, in Reload
    Worker()
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\plugin_event_multiplexer.py", line 241, in Worker
    accumulator.Reload()
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\plugin_event_accumulator.py", line 202, in Reload
    for event in self._generator.Load():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\directory_watcher.py", line 88, in Load
    for event in self._LoadInternal():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\directory_watcher.py", line 118, in _LoadInternal
    for event in self._loader.Load():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 270, in Load
    for event in super().Load():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 244, in Load
    for record in super().Load():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 178, in Load
    yield next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 109, in __next__
    self._reader.GetNext()
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\pywrap_tensorflow.py", line 207, in GetNext
    header_str = self._read(8)
                 ^^^^^^^^^^^^^
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\pywrap_tensorflow.py", line 273, in _read
    new_data = self.file_handle.read(n)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\io\gfile.py", line 736, in read
    (self.buff, self.continuation_token) = self.fs.read(
                                           ^^^^^^^^^^^^^
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\io\gfile.py", line 141, in read
    data = f.read(size)
           ^^^^^^^^^^^^
MemoryError
```

### Generate and upload debug_report.log

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Double the ram usage after upgrading to Torch 2.8 #1125

What happened?

What did you expect would happen?

Relevant log output

Generate and upload debug_report.log

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Double the ram usage after upgrading to Torch 2.8 #1125

Description

What happened?

What did you expect would happen?

Relevant log output

Generate and upload debug_report.log

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions