-
Notifications
You must be signed in to change notification settings - Fork 106
Description
Disclaimer: Work in progress. Community input is appreciated to refine the feature details and requirements.
Overview
We want a cuco::queue data structure implementing producer-consumer data exchange inspired by an
internal prototype that supports GPU producers and flexible consumers.
The prototype implementation provides a multiple producer single consumer (MPSC) queue with a
notifier abstraction ("doorbell") that batches submissions efficiently on the producer side and
can be adapted to different notification mechanisms.
The core ideas are:
- A circular queue layout as a linear array in memory
- A queue control block with atomic head and tail
- Device side push operations that reserve slots and write entries
- A notifier (doorbell) that aggregates submissions so consumers can observe progress efficiently
The doorbell concept is a small state machine that tracks the next visible position and allows
producers to publish contiguous ranges while consumers poll or wait on progress updates.
This concept can be used to optimize for different message passing scenarios listed below.
Potential message passing scenarios:
- GPU to CPU (already implemented in the prototype)
- GPU to GPU (NVLink or network)
- CTA to CTA (via global memory)
- Warp to warp (via shared memory for warp specialized workloads)
Goals
- Provide a cuco queue API with owning and non-owning reference types.
- Provide MPSC as the first supported mode with a clear path to MPMC.
- Provide doorbell/notifier policies that batch submissions efficiently for different message passing scenarios.
- Support device side push and pop plus host side bulk or polling operations.
- Keep storage layout and control layout explicit and allocator driven.
Proposed API surface
namespace cuco {
enum class queue_mode { mpsc, mpmc };
struct queue_policy {
static constexpr queue_mode mode = queue_mode::mpsc;
static constexpr cuda::thread_scope scope = cuda::thread_scope_device;
using notifier = cuco::queue_notifier::thread;
};
template <class T,
class Policy = cuco::queue_policy,
class Allocator = cuco::cuda_allocator<cuda::std::byte>>
class queue;
template <class T,
class Policy = cuco::queue_policy>
class queue_ref;
} // namespace cuco
Queue and storage construction and refs
queue_storage<T, Policy> storage(...)owning storage classqueue_storage_ref<T, Policy> storage_ref(...)non-owning storage classqueue(size_type capacity, cuda_thread_scope<Policy::scope> = {}, Policy const& = {}, Allocator = {}, cuda::stream_ref = {})queue_ref<T, Policy> queue_ref(queue_storage_ref<T, Policy> const& storage)
Device side operations (queue_ref)
__device__ bool try_push(T const& value)template <class CG> __device__ bool try_push(CG group, T const& value)group-cooperative push (cooperative slot reservation)__device__ bool try_pop(T& out)template <class CG> __device__ bool try_pop(CG group, T& out)__device__ void push(T const& value)blocking pushtemplate <class CG> __device__ void push(CG group, T const& value)__device__ T pop()template <class CG> __device__ T pop(CG group)
Host side operations (queue)
size_type pop_bulk(OutputIt out, size_type max_items, cuda::stream_ref)size_type pop_bulk_async(OutputIt out, size_type max_items, cuda::stream_ref)void wait(size_type min_items, cuda::stream_ref)wait formin_itemsto be availablevoid wait_async(size_type min_items, cuda::stream_ref)
Policy and notifier
queue_policy bundles behavior that affects concurrency semantics and notification strategy:
modeselects MPSC or MPMC behavior.scopeselects the CUDA thread scope for atomics.notifierselects how producers publish progress to consumers (the doorbell).
Potential constraints and risks
- The prototype uses
length - 1usable entries to avoid notifier overflow. This affects capacity
semantics and should be surfaced in cuco docs or compensated by an internal padding slot. - The prototype uses busy wait and nanosleep in alloc and ring; host side polling should allow
configurable backoff to avoid CPU spin. - The prototype notifier assumes contiguous submission, which may be harder in MPMC pops where
multiple consumers race. We need to decide whether MPMC uses a different notifier policy
or an alternative pop algorithm. - The prototype
bitmapnotifier uses a bitmap sized to queue length. For large queues this
may add memory overhead. Consider a chunked bitmap or compile time policy selection. - The prototype uses custom
umemallocators and device types. cuco must map this to its allocator
and memory resource model.
Open questions
- Looking for more use cases and scenarios.
- Should MPMC be in scope for the first version, or do we ship MPSC first and extend later.
- Should the notifier policy be user facing or hidden behind template defaults.