Skip to content

Communicate blocksize constraints to kernels that take blocksize as a runtime argument #1317

@mm04926412

Description

@mm04926412

Feature request

Looking through the code base I've noticed in places like kgemm_4bit_inference_naive that there is integer division by block_size on GPU where block_size is a runtime argument, not a template argument. On the python front end there is a constraint that blocksize be a power of 2 but that isn't communicated to the kernel. integer division without a bitshift simplification has poor performance on GPU. Rewrite these kernels so that they can replace the integer divisions with bitshifts.

Motivation

Integer division is slow but not with powers of two, the kernels don't know they can just bitshift because the constraint is only enforced on the python front end.

Your contribution

I'd be happy to submit a PR to resolve this if there isn't some deeper reason why things are written this way.

Metadata

Metadata

Labels

EnhancementNew feature or requestLow RiskRisk of bugs in transformers and other libraries

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions