Skip to content

PTX Backend#18

Open
WillTrojak wants to merge 11 commits into
PyFR:masterfrom
WillTrojak:feature/ptx
Open

PTX Backend#18
WillTrojak wants to merge 11 commits into
PyFR:masterfrom
WillTrojak:feature/ptx

Conversation

@WillTrojak
Copy link
Copy Markdown
Member

This adds a PTX backend to GiMMiK. The key features are:

  • Mild optimisation of exist CUDA algorithms.
  • Optional async loads for some sparse kernels
  • Added dense generation for Hopper and above

Optimisations have focused on FP64, FP32 is future work.

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
yield (tpl, args, meta)

# Warp-specialised dense DMMA
if cc >= (10, 0):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this gate consumer cards with less shared memory?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the best way to handle this is. I've added a DENSE_SMEM_MAX but we could set this via the ini or driver?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If consumer cards can pass the check they need to work. Not sure if there is a clear mapping from CC to max smem. Otherwise, have the caller pass in additional info about max shared memory.

Comment thread gimmik/ptx.py Outdated
@@ -0,0 +1,276 @@
# -*- coding: utf-8 -*-

import struct
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8

Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
@FreddieWitherden
Copy link
Copy Markdown
Contributor

I know this is an utter pain but for FP32/FP64 can you confirm correctness for all relevant PyFR matrices at a suite of N values for all instances where a kernel is expected to work on A100/H100/B100)?

Comment thread gimmik/kernels/ptx/base.mako Outdated
.param .u64 _c)
{
% endif
.reg .u32 n, id, tid_x, tid_y;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure we throw higher up if n is too big.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't handle n being too large in any of the other backends.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/PyFR/GiMMiK/blob/master/gimmik/kernels/cuda/cstream.mako#L20 in the embedded case we do (argument case doesn't but that is not currently used for CUDA).

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
% if afix[row_j] == -1:
% if beta == 0:
{
.reg .${pftype} _tmp;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be factored up as appears in both branches?

Comment thread gimmik/kernels/ptx/cstream-ksplit.mako Outdated
Comment thread gimmik/kernels/ptx/bstream.mako
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/cuda.py
nnz = np.count_nonzero(arr)
nuq = len(np.unique(np.abs(arr)))
density = nnz / arr.size
return (nuq <= 28) or (density <= 0.15)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if these could do with tuning

Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/kernels/ptx/bstream-msplit.mako
% for idx, kx in enumerate(bchunks[bb]):
ld.shared.${pftype} bv, [bsub_thread + ${bsub_off(buf_cur, idx)}];
% for j, row_j in enumerate(mcx):
<% jx = A[row_j, kx] %>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if NumPy can be used in the for loop A[mcx, kx]

.reg .pred pm_${mt};
{
.reg .u32 crow;
add.u32 crow, r_div4, ${mt * 8};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to put constant first so 8*mt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants