We are working on generative AI models training. Like training FLUX, or Qwen Image or Wan 2.2.
We have noticed that we are getting massive speed loss when we do big data transfer between RAM and GPU on Windows compared to Linux.
The hit is such a big scale that Linux runs 2x faster than Windows even more.
Tests are made on same : GPU RTX 5090
You can read more info here : kohya-ss/musubi-tuner#700
It turns out if we enable TCC mode on Windows, it gets equal speed as Linux.
However NVIDIA blocked this at driver level.
I found a Chinese article with just changing few letters, via Patching nvlddmkm.sys, the TCC mode fully becomes working on consumer GPUs. However this option is extremely hard and complex for average users.
Now my question is, why we can't get Linux speed on Windows?
Everything I found says it is due to driver mode WDDM
Moreover it seems like Microsoft added this feature : MCDM
https://learn.microsoft.com/en-us/windows-hardware/drivers/display/mcdm-architecture
And as far as I understood, MCDM mode should be also same speed.
How can we solve this slowness on Windows compared to Linux?
Our issue is happening due to this. Recent AI models are massive and not fitting into GPU. So we are doing Block Swapping. Which means only the model blocks that will be trained being on GPU. So we swap model between RAM and GPU constantly.
As you can imagine this is a massive data transfer. This is being ultra fast on Linux on same hardware. However on Windows, it is like at least 3x slower and we couldn't solve this issue yet.