PTX Backend#18
Conversation
| yield (tpl, args, meta) | ||
|
|
||
| # Warp-specialised dense DMMA | ||
| if cc >= (10, 0): |
There was a problem hiding this comment.
Does this gate consumer cards with less shared memory?
There was a problem hiding this comment.
Not sure what the best way to handle this is. I've added a DENSE_SMEM_MAX but we could set this via the ini or driver?
There was a problem hiding this comment.
If consumer cards can pass the check they need to work. Not sure if there is a clear mapping from CC to max smem. Otherwise, have the caller pass in additional info about max shared memory.
| @@ -0,0 +1,276 @@ | |||
| # -*- coding: utf-8 -*- | |||
|
|
|||
| import struct | |||
|
I know this is an utter pain but for FP32/FP64 can you confirm correctness for all relevant PyFR matrices at a suite of N values for all instances where a kernel is expected to work on A100/H100/B100)? |
| .param .u64 _c) | ||
| { | ||
| % endif | ||
| .reg .u32 n, id, tid_x, tid_y; |
There was a problem hiding this comment.
Ensure we throw higher up if n is too big.
There was a problem hiding this comment.
We don't handle n being too large in any of the other backends.
There was a problem hiding this comment.
https://github.com/PyFR/GiMMiK/blob/master/gimmik/kernels/cuda/cstream.mako#L20 in the embedded case we do (argument case doesn't but that is not currently used for CUDA).
| % if afix[row_j] == -1: | ||
| % if beta == 0: | ||
| { | ||
| .reg .${pftype} _tmp; |
There was a problem hiding this comment.
Can this be factored up as appears in both branches?
| nnz = np.count_nonzero(arr) | ||
| nuq = len(np.unique(np.abs(arr))) | ||
| density = nnz / arr.size | ||
| return (nuq <= 28) or (density <= 0.15) |
There was a problem hiding this comment.
Check if these could do with tuning
| % for idx, kx in enumerate(bchunks[bb]): | ||
| ld.shared.${pftype} bv, [bsub_thread + ${bsub_off(buf_cur, idx)}]; | ||
| % for j, row_j in enumerate(mcx): | ||
| <% jx = A[row_j, kx] %> |
There was a problem hiding this comment.
See if NumPy can be used in the for loop A[mcx, kx]
| .reg .pred pm_${mt}; | ||
| { | ||
| .reg .u32 crow; | ||
| add.u32 crow, r_div4, ${mt * 8}; |
There was a problem hiding this comment.
Try to put constant first so 8*mt
This adds a PTX backend to GiMMiK. The key features are:
Optimisations have focused on FP64, FP32 is future work.