-
-
Notifications
You must be signed in to change notification settings - Fork 816
Description
Feature request
Looking through the code base I've noticed in places like kgemm_4bit_inference_naive that there is integer division by block_size on GPU where block_size is a runtime argument, not a template argument. On the python front end there is a constraint that blocksize be a power of 2 but that isn't communicated to the kernel. integer division without a bitshift simplification has poor performance on GPU. Rewrite these kernels so that they can replace the integer divisions with bitshifts.
Motivation
Integer division is slow but not with powers of two, the kernels don't know they can just bitshift because the constraint is only enforced on the python front end.
Your contribution
I'd be happy to submit a PR to resolve this if there isn't some deeper reason why things are written this way.