Skip to content

[Bug]: Double the ram usage after upgrading to Torch 2.8 #1125

@Ulexer

Description

@Ulexer

What happened?

After updating to the latest commit I got OOM, trying to train Lora for Qwen Image bf16 with 1.0 cpu offload. During caching model loaded as usual, with 80gb of ram used. Then, once actual steps started, ram usage went up, before crashing OneTrainer. I tried manually reverting to the commit that worked before, but got OOM again. After downgrading Torch to 2.7.1 the issue disappeared.

What did you expect would happen?

Not getting OOM with the same config that worked before

Relevant log output

Starting UI...
C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\default.py:30: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 1173.66it/s]
TensorFlow installation not found - running with reduced feature set.
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.19.0 at http://localhost:6006/ (Press CTRL+C to quit)
The config attributes {'pooled_projection_dim': 768} were passed to QwenImageTransformer2DModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Selected layers: 720
Deselected layers: 126
Note: Enable Debug mode to see the full list of layer names
Exception in thread Reloader:
Traceback (most recent call last):
  File "C:\Users\ulexe\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "C:\Users\ulexe\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\data_ingester.py", line 108, in _reload
    self._multiplexer.Reload()
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\plugin_event_multiplexer.py", line 263, in Reload
    Worker()
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\plugin_event_multiplexer.py", line 241, in Worker
    accumulator.Reload()
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\plugin_event_accumulator.py", line 202, in Reload
    for event in self._generator.Load():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\directory_watcher.py", line 88, in Load
    for event in self._LoadInternal():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\directory_watcher.py", line 118, in _LoadInternal
    for event in self._loader.Load():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 270, in Load
    for event in super().Load():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 244, in Load
    for record in super().Load():
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 178, in Load
    yield next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 109, in __next__
    self._reader.GetNext()
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\pywrap_tensorflow.py", line 207, in GetNext
    header_str = self._read(8)
                 ^^^^^^^^^^^^^
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\pywrap_tensorflow.py", line 273, in _read
    new_data = self.file_handle.read(n)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\io\gfile.py", line 736, in read
    (self.buff, self.continuation_token) = self.fs.read(
                                           ^^^^^^^^^^^^^
  File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\io\gfile.py", line 141, in read
    data = f.read(size)
           ^^^^^^^^^^^^
MemoryError

Generate and upload debug_report.log

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingfollowupFailure to provide config or other info or needs followup

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions