[FEA] Add MPSC/MPMC concurrent queue

<font color="red">Disclaimer: Work in progress. Community input is appreciated to refine the feature details and requirements.</font>

## Overview
We want a `cuco::queue` data structure implementing producer-consumer data exchange inspired by an
internal prototype that supports GPU producers and flexible consumers.
The prototype implementation provides a multiple producer single consumer (MPSC) queue with a
notifier abstraction ("doorbell") that batches submissions efficiently on the producer side and
can be adapted to different notification mechanisms.

The core ideas are: 
- A circular queue layout as a linear array in memory
- A queue control block with atomic head and tail
- Device side push operations that reserve slots and write entries
- A notifier (doorbell) that aggregates submissions so consumers can observe progress efficiently

The doorbell concept is a small state machine that tracks the next visible position and allows
producers to publish contiguous ranges while consumers poll or wait on progress updates.
This concept can be used to optimize for different message passing scenarios listed below.

Potential message passing scenarios:
- GPU to CPU (already implemented in the prototype)
- GPU to GPU (NVLink or network)
- CTA to CTA (via global memory)
- Warp to warp (via shared memory for warp specialized workloads)

## Goals
- Provide a cuco queue API with owning and non-owning reference types.
- Provide MPSC as the first supported mode with a clear path to MPMC.
- Provide doorbell/notifier policies that batch submissions efficiently for different message passing scenarios.
- Support device side push and pop plus host side bulk or polling operations.
- Keep storage layout and control layout explicit and allocator driven.

## Proposed API surface

```
namespace cuco {

enum class queue_mode { mpsc, mpmc };

struct queue_policy {
  static constexpr queue_mode mode = queue_mode::mpsc;
  static constexpr cuda::thread_scope scope = cuda::thread_scope_device;
  using notifier = cuco::queue_notifier::thread;
};

template <class T,
          class Policy = cuco::queue_policy,
          class Allocator = cuco::cuda_allocator<cuda::std::byte>>
class queue;

template <class T,
          class Policy = cuco::queue_policy>
class queue_ref;

}  // namespace cuco
```

### Queue and storage construction and refs
- `queue_storage<T, Policy> storage(...)` owning storage class
- `queue_storage_ref<T, Policy> storage_ref(...)` non-owning storage class
- `queue(size_type capacity, cuda_thread_scope<Policy::scope> = {}, Policy const& = {}, Allocator = {},
  cuda::stream_ref = {})`
- `queue_ref<T, Policy> queue_ref(queue_storage_ref<T, Policy> const& storage)`

### Device side operations (queue_ref)
- `__device__ bool try_push(T const& value)`
- `template <class CG> __device__ bool try_push(CG group, T const& value)` group-cooperative push (cooperative slot reservation)
- `__device__ bool try_pop(T& out)`
- `template <class CG> __device__ bool try_pop(CG group, T& out)`
- `__device__ void push(T const& value)` blocking push
- `template <class CG> __device__ void push(CG group, T const& value)`
- `__device__ T pop()`
- `template <class CG> __device__ T pop(CG group)`

### Host side operations (queue)
- `size_type pop_bulk(OutputIt out, size_type max_items, cuda::stream_ref)`
- `size_type pop_bulk_async(OutputIt out, size_type max_items, cuda::stream_ref)`
- `void wait(size_type min_items, cuda::stream_ref)` wait for `min_items` to be available
- `void wait_async(size_type min_items, cuda::stream_ref)`

### Policy and notifier
`queue_policy` bundles behavior that affects concurrency semantics and notification strategy:
- `mode` selects MPSC or MPMC behavior.
- `scope` selects the CUDA thread scope for atomics.
- `notifier` selects how producers publish progress to consumers (the doorbell).

## Potential constraints and risks
- The prototype uses `length - 1` usable entries to avoid notifier overflow. This affects capacity
  semantics and should be surfaced in cuco docs or compensated by an internal padding slot.
- The prototype uses busy wait and nanosleep in alloc and ring; host side polling should allow
  configurable backoff to avoid CPU spin.
- The prototype notifier assumes contiguous submission, which may be harder in MPMC pops where
  multiple consumers race. We need to decide whether MPMC uses a different notifier policy
  or an alternative pop algorithm.
- The prototype `bitmap` notifier uses a bitmap sized to queue length. For large queues this
  may add memory overhead. Consider a chunked bitmap or compile time policy selection.
- The prototype uses custom `umem` allocators and device types. cuco must map this to its allocator
  and memory resource model.

## Open questions
- Looking for more use cases and scenarios.
- Should MPMC be in scope for the first version, or do we ship MPSC first and extend later.
- Should the notifier policy be user facing or hidden behind template defaults.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add MPSC/MPMC concurrent queue #791

Overview

Goals

Proposed API surface

Queue and storage construction and refs

Device side operations (queue_ref)

Host side operations (queue)

Policy and notifier

Potential constraints and risks

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Add MPSC/MPMC concurrent queue #791

Description

Overview

Goals

Proposed API surface

Queue and storage construction and refs

Device side operations (queue_ref)

Host side operations (queue)

Policy and notifier

Potential constraints and risks

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions