-
-
Notifications
You must be signed in to change notification settings - Fork 253
Open
Labels
bugSomething isn't workingSomething isn't workingfollowupFailure to provide config or other info or needs followupFailure to provide config or other info or needs followup
Description
What happened?
After updating to the latest commit I got OOM, trying to train Lora for Qwen Image bf16 with 1.0 cpu offload. During caching model loaded as usual, with 80gb of ram used. Then, once actual steps started, ram usage went up, before crashing OneTrainer. I tried manually reverting to the commit that worked before, but got OOM again. After downgrading Torch to 2.7.1 the issue disappeared.
What did you expect would happen?
Not getting OOM with the same config that worked before
Relevant log output
Starting UI...
C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\default.py:30: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 1173.66it/s]
TensorFlow installation not found - running with reduced feature set.
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.19.0 at http://localhost:6006/ (Press CTRL+C to quit)
The config attributes {'pooled_projection_dim': 768} were passed to QwenImageTransformer2DModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Selected layers: 720
Deselected layers: 126
Note: Enable Debug mode to see the full list of layer names
Exception in thread Reloader:
Traceback (most recent call last):
File "C:\Users\ulexe\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 1045, in _bootstrap_inner
self.run()
File "C:\Users\ulexe\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\data_ingester.py", line 108, in _reload
self._multiplexer.Reload()
File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\plugin_event_multiplexer.py", line 263, in Reload
Worker()
File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\plugin_event_multiplexer.py", line 241, in Worker
accumulator.Reload()
File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\plugin_event_accumulator.py", line 202, in Reload
for event in self._generator.Load():
File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\directory_watcher.py", line 88, in Load
for event in self._LoadInternal():
File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\directory_watcher.py", line 118, in _LoadInternal
for event in self._loader.Load():
File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 270, in Load
for event in super().Load():
File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 244, in Load
for record in super().Load():
File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 178, in Load
yield next(self._iterator)
^^^^^^^^^^^^^^^^^^^^
File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 109, in __next__
self._reader.GetNext()
File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\pywrap_tensorflow.py", line 207, in GetNext
header_str = self._read(8)
^^^^^^^^^^^^^
File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\pywrap_tensorflow.py", line 273, in _read
new_data = self.file_handle.read(n)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\io\gfile.py", line 736, in read
(self.buff, self.continuation_token) = self.fs.read(
^^^^^^^^^^^^^
File "C:\AI\OneTrainer\OneTrainer\venv\Lib\site-packages\tensorboard\compat\tensorflow_stub\io\gfile.py", line 141, in read
data = f.read(size)
^^^^^^^^^^^^
MemoryErrorGenerate and upload debug_report.log
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingfollowupFailure to provide config or other info or needs followupFailure to provide config or other info or needs followup