Skip to content

[FEA] Add MPSC/MPMC concurrent queue #791

@sleeepyjack

Description

@sleeepyjack

Disclaimer: Work in progress. Community input is appreciated to refine the feature details and requirements.

Overview

We want a cuco::queue data structure implementing producer-consumer data exchange inspired by an
internal prototype that supports GPU producers and flexible consumers.
The prototype implementation provides a multiple producer single consumer (MPSC) queue with a
notifier abstraction ("doorbell") that batches submissions efficiently on the producer side and
can be adapted to different notification mechanisms.

The core ideas are:

  • A circular queue layout as a linear array in memory
  • A queue control block with atomic head and tail
  • Device side push operations that reserve slots and write entries
  • A notifier (doorbell) that aggregates submissions so consumers can observe progress efficiently

The doorbell concept is a small state machine that tracks the next visible position and allows
producers to publish contiguous ranges while consumers poll or wait on progress updates.
This concept can be used to optimize for different message passing scenarios listed below.

Potential message passing scenarios:

  • GPU to CPU (already implemented in the prototype)
  • GPU to GPU (NVLink or network)
  • CTA to CTA (via global memory)
  • Warp to warp (via shared memory for warp specialized workloads)

Goals

  • Provide a cuco queue API with owning and non-owning reference types.
  • Provide MPSC as the first supported mode with a clear path to MPMC.
  • Provide doorbell/notifier policies that batch submissions efficiently for different message passing scenarios.
  • Support device side push and pop plus host side bulk or polling operations.
  • Keep storage layout and control layout explicit and allocator driven.

Proposed API surface

namespace cuco {

enum class queue_mode { mpsc, mpmc };

struct queue_policy {
  static constexpr queue_mode mode = queue_mode::mpsc;
  static constexpr cuda::thread_scope scope = cuda::thread_scope_device;
  using notifier = cuco::queue_notifier::thread;
};

template <class T,
          class Policy = cuco::queue_policy,
          class Allocator = cuco::cuda_allocator<cuda::std::byte>>
class queue;

template <class T,
          class Policy = cuco::queue_policy>
class queue_ref;

}  // namespace cuco

Queue and storage construction and refs

  • queue_storage<T, Policy> storage(...) owning storage class
  • queue_storage_ref<T, Policy> storage_ref(...) non-owning storage class
  • queue(size_type capacity, cuda_thread_scope<Policy::scope> = {}, Policy const& = {}, Allocator = {}, cuda::stream_ref = {})
  • queue_ref<T, Policy> queue_ref(queue_storage_ref<T, Policy> const& storage)

Device side operations (queue_ref)

  • __device__ bool try_push(T const& value)
  • template <class CG> __device__ bool try_push(CG group, T const& value) group-cooperative push (cooperative slot reservation)
  • __device__ bool try_pop(T& out)
  • template <class CG> __device__ bool try_pop(CG group, T& out)
  • __device__ void push(T const& value) blocking push
  • template <class CG> __device__ void push(CG group, T const& value)
  • __device__ T pop()
  • template <class CG> __device__ T pop(CG group)

Host side operations (queue)

  • size_type pop_bulk(OutputIt out, size_type max_items, cuda::stream_ref)
  • size_type pop_bulk_async(OutputIt out, size_type max_items, cuda::stream_ref)
  • void wait(size_type min_items, cuda::stream_ref) wait for min_items to be available
  • void wait_async(size_type min_items, cuda::stream_ref)

Policy and notifier

queue_policy bundles behavior that affects concurrency semantics and notification strategy:

  • mode selects MPSC or MPMC behavior.
  • scope selects the CUDA thread scope for atomics.
  • notifier selects how producers publish progress to consumers (the doorbell).

Potential constraints and risks

  • The prototype uses length - 1 usable entries to avoid notifier overflow. This affects capacity
    semantics and should be surfaced in cuco docs or compensated by an internal padding slot.
  • The prototype uses busy wait and nanosleep in alloc and ring; host side polling should allow
    configurable backoff to avoid CPU spin.
  • The prototype notifier assumes contiguous submission, which may be harder in MPMC pops where
    multiple consumers race. We need to decide whether MPMC uses a different notifier policy
    or an alternative pop algorithm.
  • The prototype bitmap notifier uses a bitmap sized to queue length. For large queues this
    may add memory overhead. Consider a chunked bitmap or compile time policy selection.
  • The prototype uses custom umem allocators and device types. cuco must map this to its allocator
    and memory resource model.

Open questions

  • Looking for more use cases and scenarios.
  • Should MPMC be in scope for the first version, or do we ship MPSC first and extend later.
  • Should the notifier policy be user facing or hidden behind template defaults.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions