CUDA performance regression due to DefaultMaxRegistersPerThread #5
Replies: 2 comments
-
|
Thanks for reporting this. I'll look into it asap. |
Beta Was this translation helpful? Give feedback.
-
|
Fixed in 4.9.6 (commit 2ec94d6, currently Root cause was that
Thanks for the report — including the before/after register counts made the root cause immediately clear. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
There is a severe performance regression in the CUDA backend, introduced by DefaultMaxRegistersPerThread:
public static int DefaultMaxRegistersPerThread { get; set; } = 255;
In one of my kernels, this caused the kernel to use 94 registers instead of 42 when compiled against the old version (4.7.2). When set to 0, I get identical kernels with 4.7.2.
This probably need some rethought, because the default is performance killer.
Beta Was this translation helpful? Give feedback.
All reactions