Track allocation origin to fix cross-thread/stream attribution#114
Track allocation origin to fix cross-thread/stream attribution#114ducndh wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
The reservation_aware_resource_adaptor previously attributed every
deallocate to the calling thread's (PER_THREAD mode) or stream's
(PER_STREAM mode) currently-attached reservation. This is incorrect when
a buffer's lifetime crosses thread or stream boundaries: a buffer
allocated under reservation R1 but freed while R2 is the active
reservation gets debited against R2, driving R2's allocated_bytes
counter negative and corrupting subsequent release accounting (the
arena-size minus allocated-bytes computation flips sign and over-drains
the global counter, eventually wrapping the unsigned total).
Fix: record the origin reservation at allocate-time in a global
ptr -> weak_ptr<arena> map; at deallocate-time, debit the origin
reservation regardless of the calling thread/stream's current state. If
the origin reservation has already been logically released (or fully
destructed), fall back to the symmetric global-counter debit path
(upstream_reclaimed_bytes = tracking_bytes), which balances the alloc.
Implementation:
- device_reserved_arena gains an idempotent release() and a logical
release flag. Destructor calls release().
- stream_ordered_tracker_state::memory_reservation switches from
unique_ptr to shared_ptr so the origin map can hold weak_ptr.
- assign_reservation_to_tracker takes shared_ptr; impl_type::
attach_reservation_to_tracker converts at the API boundary.
- reset_tracker_state captures the arena, erases the tracker entry,
then calls arena->release() outside the tracker mutex. Outstanding
weak_ptr holders keep the arena physically alive but observe its
released state.
- allocate records (ptr -> weak_ptr<arena>) in the alloc-origin map.
- deallocate locks the weak_ptr, debits the origin arena if alive and
not logically released, otherwise just decrements the global counter.
The existing "multi-reservation memory_resource mismatch" test asserted
the legacy buggy behaviour ("allocations from another memory resource is
absorbed by other stream as extra reservation"). It is rewritten to
verify origin attribution: each buffer is debited from the reservation
that allocated it, not the dealloc stream's reservation.
Verified by a full-pipeline reproducer in sirius (cucascade-using GPU
SQL engine): the legacy bug fired ~50 cross-stream-dealloc events plus
~50 inflated-release events per 30-second run, and the symptom was
spurious GPU-OOM after sustained workload due to the global-counter
wrap. With this fix both event classes drop to zero across single-load
and doubled-load reproducers; sirius's 43-test suite covering
downgrade_executor / convertible_gpu_pipeline_task / gpu_pipeline_task /
scan paths still passes.
Note: the alloc-origin map uses a single mutex for v1; sharding to a
concurrent or per-shard map is a sensible follow-up.
There was a problem hiding this comment.
Pull request overview
Fixes incorrect reservation accounting when buffers are deallocated from a different stream/thread than the one that performed the allocation, by tracking the originating reservation per allocation and debiting the correct arena (or the global counter if the reservation has been logically released).
Changes:
- Track allocation origin via a sharded
ptr -> weak_ptr<device_reserved_arena>map and use it during deallocation to debit the correct reservation. - Add an idempotent
device_reserved_arena::release()and switch tracker ownership toshared_ptrso origin tracking remains valid afterreset_tracker_state. - Update/extend tests to cover cross-stream and cross-thread deallocation attribution (including PER_THREAD mode).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/memory/reservation_aware_resource_adaptor.cpp |
Adds origin tracking and updates allocation/deallocation accounting behavior. |
include/cucascade/memory/detail/reservation_aware_resource_adaptor_impl.hpp |
Makes reservation release idempotent and updates tracker arena ownership to shared_ptr. |
test/memory/test_memory_reservation_manager.cpp |
Rewrites the legacy mismatch test and adds PER_THREAD cross-thread dealloc coverage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::fprintf(stderr, | ||
| "[ORIGIN-DEBIT-NEGATIVE #%d] origin_arena=%p tracking=%zu " | ||
| "arena.allocated_bytes %ld -> %ld (reservation_size=%ld). " | ||
| "Likely double-free or bypassed alloc path.\n", | ||
| n, | ||
| static_cast<void*>(origin_arena), | ||
| tracking_bytes, | ||
| pre_deallocation_size, | ||
| post_deallocation_size, | ||
| reservation_size); |
There was a problem hiding this comment.
The std::fprintf format string uses %ld for int64_t values (pre_deallocation_size, post_deallocation_size, reservation_size). int64_t is not guaranteed to be long on all platforms, which can lead to undefined behavior / corrupted output. Prefer using <cinttypes> macros (e.g., PRId64) or cast to long long and print with %lld consistently.
| std::fprintf(stderr, | ||
| "[INFLATED-RELEASE #%d] arena_size=%ld alloc_bytes=%ld -> " | ||
| "released_bytes=%zu (over-drain by %ld bytes); " | ||
| "_total_allocated_bytes before sub = %lld\n", | ||
| n, | ||
| arena_size, | ||
| allocation_size, | ||
| released_bytes, | ||
| -allocation_size, | ||
| cur); |
There was a problem hiding this comment.
The debug std::fprintf here prints int64_t values (arena_size, allocation_size) using %ld. int64_t is not guaranteed to match long, so this is technically undefined behavior on some platforms. Consider switching to PRId64 (from <cinttypes>) or using long long/%lld to make the diagnostics portable and correct.
| // Idempotent: only the first call subtracts the arena's unused portion from the global counter. | ||
| void release() noexcept | ||
| { | ||
| bool expected = false; | ||
| if (_logically_released.compare_exchange_strong(expected, true)) { | ||
| _impl->do_release_reservation(this); | ||
| } |
There was a problem hiding this comment.
device_reserved_arena::release() triggers do_release_reservation() which snapshots allocated_bytes via load(). If release() can run concurrently with an in-flight deallocate() that is still debiting allocated_bytes under the (not-yet-released) path, the release calculation can observe a stale allocation_size and leave _total_allocated_bytes inflated. Consider documenting that reset_stream_reservation()/release() must not race with allocate/deallocate for that arena, or add synchronization (e.g., an arena-level mutex or a release protocol that prevents concurrent origin-debits while computing released_bytes).
|
/ok to test 44f505a |
felipeblazing
left a comment
There was a problem hiding this comment.
I feel like the right approach is not to try and extend lifecycles but rather to be able to move something from one reservation to another. So when that data is passing to another context we want its allocation to be tracked by the new reservation. We also want reservations to be able to account for previously existing reservations. I think that this is going to need a design document to make sure we are aligned on the approach.
| } | ||
|
|
||
| ~device_reserved_arena() noexcept { _impl->do_release_reservation(this); } | ||
| ~device_reserved_arena() noexcept { release(); } |
There was a problem hiding this comment.
why would the destructor be called more than once?
| * @brief Reservation state | ||
| */ | ||
| struct stream_ordered_tracker_state { | ||
| std::unique_ptr<device_reserved_arena> |
There was a problem hiding this comment.
shouldn't we just be managing this by keeping the stream_ordered_tracker_state alive which holds the arena?
|
|
||
| namespace { | ||
|
|
||
| // Origin reservation per allocation; weak_ptr survives reset_tracker_state. |
There was a problem hiding this comment.
doesn't this just mean that we are calling reset_tracker_state before the work scheduled on a stream and thread is completed? It feels like we are missing some kind of synchronization before the reset is called. I don't htink the solution is to fix it by trying to hold onto it longer..
Experimental — drops the explicit release() method + _logically_released atomic flag, returning device_reserved_arena to a dtor-only release path. Tracker reset_tracker_state callbacks correspondingly drop their release() invocations and rely on the shared_ptr lifecycle instead. Parked here so PR NVIDIA#6 (sirius duckdb-native walker) can reset the sirius-recorded cucascade pin (0cd4a6a) and run its tests. To resume this branch: cd cucascade && git checkout origin-tracked-allocation-attribution Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cuCascade origin-tracking redesign — proposalThe bugcuCascade's Setup for the timelinesTimeline 1 — what we have todayFailure point: event 5. Timeline 2 — origin tracking with extended arena lifetime (current PR #114 stopgap)cuCascade keeps a per-buffer origin tag at allocation. At dealloc, the tag is consulted to credit the alloc-origin arena (R_scan), not the dealloc-thread's TLS (R_pipe). To make this work when R_scan has already released but Two problems with this shape:
The two designs below both close the failure in Timeline 1 without extending arena lifetime and without the race. Timeline 3 — orphan tier at release timeSame origin-tag mechanism at dealloc as Timeline 2. At release, in-flight bytes are moved to a separate visible counter ( Mental model: a reservation is operator workspace I will allocate against — incoming buffers from upstream stages don't consume it. R_pipe's per-arena counter is never touched by the misattribution; sirius reads R_pipe.size minus what R_pipe itself has allocated, and the value is always correct. This matches sirius's existing reservation-sizing logic — [†] The arena C++ object dies at user-visible release, satisfying a strict reading of arena does not live past its usage. The orphan counter does retain bytes attributed to the released reservation until late deallocs reconcile. If the objection is to the arena outliving release, this satisfies it; if the objection is to any bookkeeping outliving release, it does not — Timeline 4 below is the only design that satisfies that strict reading. Timeline 4 — explicit transfer at handoff timeSirius declares an ownership transfer at every cross-thread handoff site via a new Mental model: a reservation is all the memory this stage is responsible for at any moment — including buffers transferred in from upstream. R_pipe's per-arena counter reflects everything R_pipe owes the system, in a single number. Cost: ~15 audited handoff sites in sirius today need a Empirical demonstration of point 1 (today's bug)We added a deterministic Catch2 SCENARIO that reproduces the over-allocation directly. Setup mirrors the timeline above (numbers scaled down for test simplicity): SCENARIO("PER_THREAD: −20 KB mis-credit, then 84 KB alloc on a 96 KB-capacity system",
"[memory_space][bug-repro][user-scenario]")
{
// capacity 96 KB, R_pipe = 64 KB, R_scan = 32 KB, buffer = 20 KB
auto res_pipe = manager->request_reservation({Tier::GPU, 0}, 64 * 1024);
auto res_scan = manager->request_reservation({Tier::GPU, 0}, 32 * 1024);
auto* mr = res_pipe->get_memory_resource_as<reservation_aware_resource_adaptor>();
// T_scan attaches R_scan, allocates 20 KB, hands buffer to T_pipe.
// T_pipe attaches R_pipe, deallocates the 20 KB buffer.
// → R_pipe.alloc = −20 KB
// Now T_pipe asks for 84 KB on R_pipe (R_pipe.size is only 64 KB):
void* over_buffer = mr->allocate(stream, 84 * 1024, alignof(std::max_align_t));
REQUIRE(over_buffer != nullptr); // ✓ approved
REQUIRE(cudaMemsetAsync(over_buffer, 0xDD,
84 * 1024, stream) == cudaSuccess); // ✓ real GPU memory
}Observed: the 84 KB allocation succeeds, returns a valid GPU pointer, and the memory is writable. R_pipe's declared 64 KB cap is silently exceeded by 20 KB. You now have 84 KB in pipe reservation with 20KB of scan hand off, which total of 104KB and will fail silently on 96Kb capacity In short What both solutions share, what they differ onBoth Timeline 3 and Timeline 4 close Timeline 1's failure (R_pipe.alloc never goes negative, so no phantom headroom). Both eliminate Timeline 2's lifetime extension and race. They differ on:
The choice depends on which mental model cuCascade wants its reservation API to commit to. Sirius today maps to Timeline 3. Timeline 4 is more accurate per-stage accounting, at the cost of consumer-side participation forever. More testing and code checkTo actually stress our system, I over allocate cuCascade capacity to force a bad alloc because the original actually let you allocate memory outside of cuCascade capacity
A separate validation: running the test against the existing PR #114 fix branch ( |
Bug
reservation_aware_resource_adaptordebits every dealloc against the calling thread's (PER_THREAD) or stream's (PER_STREAM) currently-attached reservation. When a buffer outlives that context — e.g., a producer thread allocates and a consumer thread destroys, the wrong reservation gets debited, itsallocated_bytesgoes negative, and the nextdo_release_reservationover-drains the global counter (arena_size - negativeflips into addition). After enough releases the unsigned_total_allocated_byteswraps and every subsequent reservation request fails.Fix
Record the originating reservation per allocation in a sharded
ptr -> weak_ptr<arena>map (16 shards). At dealloc, debit the origin reservation; if it's already released, fall through to the symmetric global-counter debit.device_reserved_arenagains an idempotentrelease()so the global counter still updates at logical release time even when outstanding allocations keep the arena physically alive viaweak_ptr.lock().assign_reservation_to_trackerswitches toshared_ptr<arena>to support this; the publicattach_reservation_to_trackerAPI is unchanged.Tests
multi-reservation memory_resource mismatch(which asserted the legacy buggy behaviour) asmulti-reservation cross-stream dealloc preserves origin attribution.PER_THREAD mode: cross-thread dealloc preserves origin attributioncovering the cross-thread path that motivated the fix. Fails under pre-fix code with the expected+alloc_sizeskew; passes under the fix.Some notes
get_tracker_statemight has pre-existing race withreset_tracker_state(raw-pointer return).