Track allocation origin to fix cross-thread/stream attribution by ducndh · Pull Request #114 · NVIDIA/cuCascade

ducndh · 2026-04-27T22:49:35Z

Bug

reservation_aware_resource_adaptor debits every dealloc against the calling thread's (PER_THREAD) or stream's (PER_STREAM) currently-attached reservation. When a buffer outlives that context — e.g., a producer thread allocates and a consumer thread destroys, the wrong reservation gets debited, its allocated_bytes goes negative, and the next do_release_reservation over-drains the global counter (arena_size - negative flips into addition). After enough releases the unsigned _total_allocated_bytes wraps and every subsequent reservation request fails.

Fix

Record the originating reservation per allocation in a sharded ptr -> weak_ptr<arena> map (16 shards). At dealloc, debit the origin reservation; if it's already released, fall through to the symmetric global-counter debit. device_reserved_arena gains an idempotent release() so the global counter still updates at logical release time even when outstanding allocations keep the arena physically alive via weak_ptr.lock(). assign_reservation_to_tracker switches to shared_ptr<arena> to support this; the public attach_reservation_to_tracker API is unchanged.

Tests

All 266 existing tests pass (7811 assertions).
Rewrote multi-reservation memory_resource mismatch (which asserted the legacy buggy behaviour) as multi-reservation cross-stream dealloc preserves origin attribution.
Added PER_THREAD mode: cross-thread dealloc preserves origin attribution covering the cross-thread path that motivated the fix. Fails under pre-fix code with the expected +alloc_size skew; passes under the fix.

Some notes

16 shards sized for ~16-32 concurrent ops; can make it tunable.
get_tracker_state might has pre-existing race with reset_tracker_state (raw-pointer return).

The reservation_aware_resource_adaptor previously attributed every deallocate to the calling thread's (PER_THREAD mode) or stream's (PER_STREAM mode) currently-attached reservation. This is incorrect when a buffer's lifetime crosses thread or stream boundaries: a buffer allocated under reservation R1 but freed while R2 is the active reservation gets debited against R2, driving R2's allocated_bytes counter negative and corrupting subsequent release accounting (the arena-size minus allocated-bytes computation flips sign and over-drains the global counter, eventually wrapping the unsigned total). Fix: record the origin reservation at allocate-time in a global ptr -> weak_ptr<arena> map; at deallocate-time, debit the origin reservation regardless of the calling thread/stream's current state. If the origin reservation has already been logically released (or fully destructed), fall back to the symmetric global-counter debit path (upstream_reclaimed_bytes = tracking_bytes), which balances the alloc. Implementation: - device_reserved_arena gains an idempotent release() and a logical release flag. Destructor calls release(). - stream_ordered_tracker_state::memory_reservation switches from unique_ptr to shared_ptr so the origin map can hold weak_ptr. - assign_reservation_to_tracker takes shared_ptr; impl_type:: attach_reservation_to_tracker converts at the API boundary. - reset_tracker_state captures the arena, erases the tracker entry, then calls arena->release() outside the tracker mutex. Outstanding weak_ptr holders keep the arena physically alive but observe its released state. - allocate records (ptr -> weak_ptr<arena>) in the alloc-origin map. - deallocate locks the weak_ptr, debits the origin arena if alive and not logically released, otherwise just decrements the global counter. The existing "multi-reservation memory_resource mismatch" test asserted the legacy buggy behaviour ("allocations from another memory resource is absorbed by other stream as extra reservation"). It is rewritten to verify origin attribution: each buffer is debited from the reservation that allocated it, not the dealloc stream's reservation. Verified by a full-pipeline reproducer in sirius (cucascade-using GPU SQL engine): the legacy bug fired ~50 cross-stream-dealloc events plus ~50 inflated-release events per 30-second run, and the symptom was spurious GPU-OOM after sustained workload due to the global-counter wrap. With this fix both event classes drop to zero across single-load and doubled-load reproducers; sirius's 43-test suite covering downgrade_executor / convertible_gpu_pipeline_task / gpu_pipeline_task / scan paths still passes. Note: the alloc-origin map uses a single mutex for v1; sharding to a concurrent or per-shard map is a sensible follow-up.

copy-pr-bot · 2026-04-27T22:49:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copilot

Pull request overview

Fixes incorrect reservation accounting when buffers are deallocated from a different stream/thread than the one that performed the allocation, by tracking the originating reservation per allocation and debiting the correct arena (or the global counter if the reservation has been logically released).

Changes:

Track allocation origin via a sharded ptr -> weak_ptr<device_reserved_arena> map and use it during deallocation to debit the correct reservation.
Add an idempotent device_reserved_arena::release() and switch tracker ownership to shared_ptr so origin tracking remains valid after reset_tracker_state.
Update/extend tests to cover cross-stream and cross-thread deallocation attribution (including PER_THREAD mode).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`src/memory/reservation_aware_resource_adaptor.cpp`	Adds origin tracking and updates allocation/deallocation accounting behavior.
`include/cucascade/memory/detail/reservation_aware_resource_adaptor_impl.hpp`	Makes reservation release idempotent and updates tracker arena ownership to `shared_ptr`.
`test/memory/test_memory_reservation_manager.cpp`	Rewrites the legacy mismatch test and adds PER_THREAD cross-thread dealloc coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T22:54:56Z

+        std::fprintf(stderr,
+                     "[ORIGIN-DEBIT-NEGATIVE #%d] origin_arena=%p tracking=%zu "
+                     "arena.allocated_bytes %ld -> %ld (reservation_size=%ld). "
+                     "Likely double-free or bypassed alloc path.\n",
+                     n,
+                     static_cast<void*>(origin_arena),
+                     tracking_bytes,
+                     pre_deallocation_size,
+                     post_deallocation_size,
+                     reservation_size);


The std::fprintf format string uses %ld for int64_t values (pre_deallocation_size, post_deallocation_size, reservation_size). int64_t is not guaranteed to be long on all platforms, which can lead to undefined behavior / corrupted output. Prefer using <cinttypes> macros (e.g., PRId64) or cast to long long and print with %lld consistently.

Copilot · 2026-04-27T22:54:57Z

+      std::fprintf(stderr,
+                   "[INFLATED-RELEASE #%d] arena_size=%ld alloc_bytes=%ld -> "
+                   "released_bytes=%zu (over-drain by %ld bytes); "
+                   "_total_allocated_bytes before sub = %lld\n",
+                   n,
+                   arena_size,
+                   allocation_size,
+                   released_bytes,
+                   -allocation_size,
+                   cur);


The debug std::fprintf here prints int64_t values (arena_size, allocation_size) using %ld. int64_t is not guaranteed to match long, so this is technically undefined behavior on some platforms. Consider switching to PRId64 (from <cinttypes>) or using long long/%lld to make the diagnostics portable and correct.

Copilot · 2026-04-27T22:54:57Z

+    // Idempotent: only the first call subtracts the arena's unused portion from the global counter.
+    void release() noexcept
+    {
+      bool expected = false;
+      if (_logically_released.compare_exchange_strong(expected, true)) {
+        _impl->do_release_reservation(this);
+      }


device_reserved_arena::release() triggers do_release_reservation() which snapshots allocated_bytes via load(). If release() can run concurrently with an in-flight deallocate() that is still debiting allocated_bytes under the (not-yet-released) path, the release calculation can observe a stale allocation_size and leave _total_allocated_bytes inflated. Consider documenting that reset_stream_reservation()/release() must not race with allocate/deallocate for that arena, or add synchronization (e.g., an arena-level mutex or a release protocol that prevents concurrent origin-debits while computing released_bytes).

felipeblazing · 2026-04-30T01:22:03Z

/ok to test 44f505a

felipeblazing

I feel like the right approach is not to try and extend lifecycles but rather to be able to move something from one reservation to another. So when that data is passing to another context we want its allocation to be tracked by the new reservation. We also want reservations to be able to account for previously existing reservations. I think that this is going to need a design document to make sure we are aligned on the approach.

felipeblazing · 2026-04-30T03:09:55Z

    }

-    ~device_reserved_arena() noexcept { _impl->do_release_reservation(this); }
+    ~device_reserved_arena() noexcept { release(); }


why would the destructor be called more than once?

felipeblazing · 2026-04-30T03:11:53Z

   * @brief Reservation state
   */
  struct stream_ordered_tracker_state {
-    std::unique_ptr<device_reserved_arena>


shouldn't we just be managing this by keeping the stream_ordered_tracker_state alive which holds the arena?

felipeblazing · 2026-04-30T03:16:32Z


 namespace {

+// Origin reservation per allocation; weak_ptr survives reset_tracker_state.


doesn't this just mean that we are calling reset_tracker_state before the work scheduled on a stream and thread is completed? It feels like we are missing some kind of synchronization before the reset is called. I don't htink the solution is to fix it by trying to hold onto it longer..

Experimental — drops the explicit release() method + _logically_released atomic flag, returning device_reserved_arena to a dtor-only release path. Tracker reset_tracker_state callbacks correspondingly drop their release() invocations and rely on the shared_ptr lifecycle instead. Parked here so PR NVIDIA#6 (sirius duckdb-native walker) can reset the sirius-recorded cucascade pin (0cd4a6a) and run its tests. To resume this branch: cd cucascade && git checkout origin-tracked-allocation-attribution Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ducndh · 2026-05-05T15:25:57Z

cuCascade origin-tracking redesign — proposal

The bug

cuCascade's reservation_aware_resource_adaptor in PER_THREAD mode (sirius's default) reads the deallocation thread's thread_local to decide which arena to debit. When a buffer is allocated on thread A (with reservation R_scan attached) and deallocated on thread B (with R_pipe attached), R_pipe is debited for memory it never allocated. R_pipe's allocated_bytes goes negative, and from that moment on, R_pipe reports more available headroom than its declared reservation size — sirius can allocate beyond R_pipe's budget, silently consuming physical memory other stages were planning on. At scale this manifests as bad_alloc from RMM despite cuCascade's accounting showing room.

Setup for the timelines

  R_scan : scan-stage reservation, size = 100 KB
  R_pipe : pipeline-stage reservation, size = 200 KB
  p      : a 80 KB buffer
  T_scan : scan worker thread (R_scan in TLS)
  T_pipe : pipeline worker thread (R_pipe in TLS)
  capacity (cuCascade) = 400 KB

Events:
  1. reserve R_scan
  2. reserve R_pipe
  3. T_scan: alloc(p, 80 KB)        ← p physically allocated by RMM
  4. handoff p → T_pipe             ← buffer pointer crosses thread pools
  5. T_pipe: dealloc(p, 80 KB)      ← p physically freed by RMM
  6. T_scan: release R_scan
  7. (time passes — R_pipe still alive)
  8. T_pipe: release R_pipe

Timeline 1 — what we have today

                              R_scan.alloc   R_pipe.alloc   _total   reality
─────────────────────────────────────────────────────────────────────────────
1. reserve R_scan                  0              -            100 K     100 K
2. reserve R_pipe                  0              0            300 K     300 K
3. T_scan alloc(p, 80 K)          80 K            0            300 K     380 K
4. handoff (cucascade unaware)    80 K            0            300 K     380 K
5. T_pipe dealloc(p, 80 K)
   reads T_pipe.tls = R_pipe      80 K          −80 K          300 K     300 K
                                  ↑              ↑
                          uncredited       OVER-DEBITED
                                                 ↑
                                  R_pipe.available = 200 − (−80) = 280 K
                                  → sirius can allocate past R_pipe's 200 K cap
                                  → over-spends physical memory; eventual bad_alloc

6. release R_scan (refund 20 K)    -          −80 K          280 K     200 K
7.                                 -          −80 K          280 K     200 K
8. release R_pipe                  -            -              0         0
   (math closes via R_pipe's negative arena.alloc)

Failure point: event 5. R_pipe.alloc becomes negative; from this point R_pipe reports phantom headroom and accepts allocations beyond its declared size. The release-time arithmetic still closes, but the per-arena view that sirius reads has been corrupted, and any allocation made on R_pipe between events 5 and 8 can exceed the budget.

Timeline 2 — origin tracking with extended arena lifetime (current PR #114 stopgap)

cuCascade keeps a per-buffer origin tag at allocation. At dealloc, the tag is consulted to credit the alloc-origin arena (R_scan), not the dealloc-thread's TLS (R_pipe). To make this work when R_scan has already released but p is still in flight, R_scan's arena is kept alive via shared_ptr past user-visible release.

                              R_scan.alloc   R_pipe.alloc   _total   reality
─────────────────────────────────────────────────────────────────────────────
1. reserve R_scan                  0              -            100 K     100 K
2. reserve R_pipe                  0              0            300 K     300 K
3. T_scan alloc(p, 80 K)
   tag_map[p] = R_scan_id         80 K            0            300 K     380 K
4. handoff                         80 K            0            300 K     380 K
5. T_pipe dealloc(p, 80 K)
   tag_map[p] → R_scan
   R_scan.alloc -= 80               0              0            300 K     300 K  ✓ correct
6. release R_scan
   arena.alloc = 0 → refund 100 K   -              0            200 K     200 K
   ⚠ if dealloc were AFTER release, the arena is kept alive past release
     until refcount drops; release-snapshot can race with concurrent sub()

Two problems with this shape:

Lifetime extension (r3165421504): R_scan's C++ arena outlives its user-visible release whenever a buffer it allocated is still in flight downstream. "Release ends lifetime" no longer holds.
Race (r3150600212, Copilot): the snapshot of arena.allocated_bytes taken inside release() can read a value that a concurrent dealloc on T_pipe is about to subtract from. The accounting drifts by exactly the in-flight buffer size.

The two designs below both close the failure in Timeline 1 without extending arena lifetime and without the race.

Timeline 3 — orphan tier at release time

Same origin-tag mechanism at dealloc as Timeline 2. At release, in-flight bytes are moved to a separate visible counter (_orphan_in_flight_bytes); the arena dies on schedule. Late deallocs find a dead arena and reconcile against the orphan counter.

                          R_scan.alloc   R_pipe.alloc   _total   _orphan   reality
──────────────────────────────────────────────────────────────────────────────────
1–4. reserve, alloc(p), handoff   80 K          0           300 K     0       380 K
5'. R_scan releases first
    snapshot R_scan.alloc = 80 K
    refund 100 − 80 = 20 K
    move 80 K → orphan
    R_scan.arena DIES                -            0           280 K    80 K     280 K
                                                              ↑       ↑
                                                  total = res + orphan + alloc
                                                  (orphan is publicly observable)
6'. T_pipe dealloc(p, 80 K)
    tag_map[p] → R_scan_id (dead)
    fallback: orphan -= 80, _total -= 80
                                     -            0           200 K     0       200 K
7'. release R_pipe (refund 200 K)    -            -             0       0         0

Mental model: a reservation is operator workspace I will allocate against — incoming buffers from upstream stages don't consume it. R_pipe's per-arena counter is never touched by the misattribution; sirius reads R_pipe.size minus what R_pipe itself has allocated, and the value is always correct.

This matches sirius's existing reservation-sizing logic — gpu_pipeline_task.cpp:425 sizes reservations as a function of expected operator compute, not "absorb incoming buffers."

[†] The arena C++ object dies at user-visible release, satisfying a strict reading of arena does not live past its usage. The orphan counter does retain bytes attributed to the released reservation until late deallocs reconcile. If the objection is to the arena outliving release, this satisfies it; if the objection is to any bookkeeping outliving release, it does not — Timeline 4 below is the only design that satisfies that strict reading.

Timeline 4 — explicit transfer at handoff time

Sirius declares an ownership transfer at every cross-thread handoff site via a new transfer(ptr, R_new, size) API. cuCascade debits R_scan and credits R_pipe at the call; R_scan is fully drained before its release, so the release is unconditionally clean.

                              R_scan.alloc   R_pipe.alloc   _total   reality
─────────────────────────────────────────────────────────────────────────────
1–3. reserve, alloc(p)            80 K            0            300 K    380 K
4. handoff: sirius calls
   mr->transfer(p, R_pipe, 80 K)
   R_scan.alloc -= 80
   R_pipe.alloc += 80
   tag_map[p] = R_pipe_id           0            80 K          300 K    380 K
                                    ↑             ↑
                          cleanly drained    takes ownership

5. T_pipe dealloc(p, 80 K)
   tag_map[p] → R_pipe                0            0           300 K    300 K
6. release R_scan (refund 100 K)     -            0           200 K    200 K
7–8. R_pipe runs and releases        -            -             0        0

Mental model: a reservation is all the memory this stage is responsible for at any moment — including buffers transferred in from upstream. R_pipe's per-arena counter reflects everything R_pipe owes the system, in a single number.

Cost: ~15 audited handoff sites in sirius today need a transfer() call; every future operator that introduces a cross-pool buffer movement must remember to add one. CI test enforcement is recommended to make sure we don't have the same problem as today.

Empirical demonstration of point 1 (today's bug)

We added a deterministic Catch2 SCENARIO that reproduces the over-allocation directly. Setup mirrors the timeline above (numbers scaled down for test simplicity):

SCENARIO("PER_THREAD: −20 KB mis-credit, then 84 KB alloc on a 96 KB-capacity system",
         "[memory_space][bug-repro][user-scenario]")
{
  // capacity 96 KB, R_pipe = 64 KB, R_scan = 32 KB, buffer = 20 KB
  auto res_pipe = manager->request_reservation({Tier::GPU, 0}, 64 * 1024);
  auto res_scan = manager->request_reservation({Tier::GPU, 0}, 32 * 1024);
  auto* mr      = res_pipe->get_memory_resource_as<reservation_aware_resource_adaptor>();

  // T_scan attaches R_scan, allocates 20 KB, hands buffer to T_pipe.
  // T_pipe attaches R_pipe, deallocates the 20 KB buffer.
  //   → R_pipe.alloc = −20 KB

  // Now T_pipe asks for 84 KB on R_pipe (R_pipe.size is only 64 KB):
  void* over_buffer = mr->allocate(stream, 84 * 1024, alignof(std::max_align_t));

  REQUIRE(over_buffer != nullptr);                     // ✓ approved
  REQUIRE(cudaMemsetAsync(over_buffer, 0xDD,
                          84 * 1024, stream) == cudaSuccess);  // ✓ real GPU memory
}

Observed: the 84 KB allocation succeeds, returns a valid GPU pointer, and the memory is writable. R_pipe's declared 64 KB cap is silently exceeded by 20 KB. You now have 84 KB in pipe reservation with 20KB of scan hand off, which total of 104KB and will fail silently on 96Kb capacity

In short What both solutions share, what they differ on

Both Timeline 3 and Timeline 4 close Timeline 1's failure (R_pipe.alloc never goes negative, so no phantom headroom). Both eliminate Timeline 2's lifetime extension and race. They differ on:

What R_pipe.alloc means. Timeline 3: workspace I've allocated. Timeline 4: total memory I'm responsible for, including data transferred-in.
Who participates in resolving the misattribution. Timeline 3: cuCascade-internal at release time. Timeline 4: sirius declares transfer at handoff time.
Maintenance cost. Timeline 3: one cuCascade PR. Timeline 4: cuCascade PR + sirius PR + ongoing audit discipline as new operators are added.

The choice depends on which mental model cuCascade wants its reservation API to commit to. Sirius today maps to Timeline 3. Timeline 4 is more accurate per-stage accounting, at the cost of consumer-side participation forever.

More testing and code check

To actually stress our system, I over allocate cuCascade capacity to force a bad alloc because the original actually let you allocate memory outside of cuCascade capacity

[user-scenario] (excerpted above) — −20 KB mis-credit + 84 KB alloc on a 64 KB budget succeeds.
[rmm-bad-alloc] — wires a bounded RMM pool (96 KB) under cuCascade with a deliberately loose _capacity = 256 KB. The over-allocation passes cuCascade's accounting check and reaches RMM, which throws std::bad_alloc from pool_memory_resource_impl.cpp:70. This reproduces the production failure pattern from the ff14ff4 commit message end-to-end: cuCascade's accounting bug → RMM-level bad_alloc.

A separate validation: running the test against the existing PR #114 fix branch (origin-tracked-allocation-attribution) shows zero accounting problem across all rounds, confirming that origin-tracking-at-dealloc is the foundation both Timeline 3 and Timeline 4 build on.

Copilot AI review requested due to automatic review settings April 27, 2026 22:49

Copilot started reviewing on behalf of ducndh April 27, 2026 22:50 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

felipeblazing requested changes Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track allocation origin to fix cross-thread/stream attribution#114

Track allocation origin to fix cross-thread/stream attribution#114
ducndh wants to merge 2 commits intoNVIDIA:mainfrom
ducndh:origin-tracked-allocation-attribution

ducndh commented Apr 27, 2026

Uh oh!

copy-pr-bot Bot commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

felipeblazing commented Apr 30, 2026

Uh oh!

felipeblazing left a comment

Uh oh!

felipeblazing Apr 30, 2026

Uh oh!

felipeblazing Apr 30, 2026

Uh oh!

felipeblazing Apr 30, 2026

Uh oh!

ducndh commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		namespace {

		// Origin reservation per allocation; weak_ptr survives reset_tracker_state.

Conversation

ducndh commented Apr 27, 2026

Bug

Fix

Tests

Some notes

Uh oh!

copy-pr-bot Bot commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

felipeblazing commented Apr 30, 2026

Uh oh!

felipeblazing left a comment

Choose a reason for hiding this comment

Uh oh!

felipeblazing Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

felipeblazing Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

felipeblazing Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ducndh commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cuCascade origin-tracking redesign — proposal

The bug

Setup for the timelines

Timeline 1 — what we have today

Timeline 2 — origin tracking with extended arena lifetime (current PR #114 stopgap)

Timeline 3 — orphan tier at release time

Timeline 4 — explicit transfer at handoff time

Empirical demonstration of point 1 (today's bug)

In short What both solutions share, what they differ on

More testing and code check

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ducndh commented May 5, 2026 •

edited

Loading