B&B with Local Heaps#1149
Conversation
Remove dependency on rmm::mr::device_memory_resource base class. Resources now satisfy the cuda::mr::resource concept directly. - Replace shared_ptr<device_memory_resource> with value types and cuda::mr::any_resource<cuda::mr::device_accessible> for type-erased storage - Replace set_current_device_resource(ptr) with set_current_device_resource_ref - Replace set_per_device_resource(id, ptr) with set_per_device_resource_ref - Remove make_owning_wrapper usage - Remove dynamic_cast on memory resources (no common base class) - Remove owning_wrapper.hpp and device_memory_resource.hpp includes - Add missing thrust/iterator/transform_output_iterator.h include (no longer transitively included via CCCL)
…nd deterministic mode. Signed-off-by: Nicolas Guidotti <224634272+nguidotti@users.noreply.github.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
… shared_ptr to avoid unnecessary copy. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…l crash in work-stealing Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…queue for now. refactoring. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
… are present Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts: # cpp/src/utilities/cuda_helpers.cuh
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts: # ci/validate_wheel.sh
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
cpp/src/branch_and_bound/branch_and_bound.cpp (1)
1687-1692: Decrementnodes_unexploredwhen a queued node is cut off.This path prunes a node that was previously counted as unexplored, but the counter is never decremented here. That leaves the solver stats inconsistent and can skew the final summary.
Suggested fix
if (upper_bound_.load() < start_node->lower_bound) { // This node was put on the heap earlier but its lower bound is now greater than the // current upper bound search_tree_.graphviz_node(settings_.log, start_node, "cutoff", start_node->lower_bound); search_tree_.update(start_node, node_status_t::FATHOMED); + --exploration_stats_.nodes_unexplored; continue; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/branch_and_bound/branch_and_bound.cpp` around lines 1687 - 1692, The cutoff branch that checks if upper_bound_.load() < start_node->lower_bound currently marks the node as FATHOMED but doesn't decrement the unexplored counter; update that if block to decrement nodes_unexplored when a queued node is pruned (i.e., right after search_tree_.update(start_node, node_status_t::FATHOMED)), using the appropriate operation for the counter's type (nodes_unexplored-- or nodes_unexplored.fetch_sub(1) if it's atomic) so solver stats remain consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/branch_and_bound/branch_and_bound.cpp`:
- Around line 1717-1718: best_first_search_with() returns the bfs worker to
bfs_worker_pool_ immediately while active diving workers still call
worker->set_inactive() and decrement total_active_diving_workers, causing reuse
races; either defer calling bfs_worker_pool_.return_worker_to_pool(worker) until
that worker's total_active_diving_workers reaches zero (check the worker's
total_active_diving_workers/active_diving_worker_count before returning) or
implement a generation/tag on the BFS worker that diving workers validate before
mutating counters (increment generation on checkout and have set_inactive()
ignore updates if generation mismatches), and add the missing synchronization
(mutex/atomic) around shared counters (total_active_diving_workers) to prevent
races.
In `@cpp/src/branch_and_bound/worker.hpp`:
- Around line 163-189: After successfully stealing a node in
bfs_worker_t::steal_node_from (i.e., after this->node_queue.push(node)), update
the thief worker’s advertised lower_bound to reflect the stolen node so
worker_pool_t::get_lower_bound() sees it; specifically, while holding the
thief’s node_queue lock (the same lock used to push), set this->lower_bound =
std::min(this->lower_bound, <stolen node’s lower-bound field>) or otherwise
assign the node’s lower-bound value if it is smaller, ensuring thread-safety and
correctness of the global lower-bound computation.
---
Nitpick comments:
In `@cpp/src/branch_and_bound/branch_and_bound.cpp`:
- Around line 1687-1692: The cutoff branch that checks if upper_bound_.load() <
start_node->lower_bound currently marks the node as FATHOMED but doesn't
decrement the unexplored counter; update that if block to decrement
nodes_unexplored when a queued node is pruned (i.e., right after
search_tree_.update(start_node, node_status_t::FATHOMED)), using the appropriate
operation for the counter's type (nodes_unexplored-- or
nodes_unexplored.fetch_sub(1) if it's atomic) so solver stats remain consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 753ef90d-0840-45d4-aeee-cfb63a2093fa
📒 Files selected for processing (5)
cpp/src/branch_and_bound/branch_and_bound.cppcpp/src/branch_and_bound/branch_and_bound.hppcpp/src/branch_and_bound/mip_node.hppcpp/src/branch_and_bound/worker.hppcpp/src/branch_and_bound/worker_pool.hpp
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/src/branch_and_bound/mip_node.hpp
| worker->set_inactive(); | ||
| bfs_worker_pool_.return_worker_to_pool(worker); |
There was a problem hiding this comment.
Don’t recycle a BFS worker while its diving workers still reference it.
best_first_search_with() returns the worker to bfs_worker_pool_ immediately, but cpp/src/branch_and_bound/worker.hpp:271-281 shows each active diving worker still decrements counters on that same bfs_worker in set_inactive(). If this object gets reused first, old dives will mutate the new assignment’s accounting and can underflow/assert on completion.
Hold the BFS worker out of the idle pool until total_active_diving_workers reaches zero, or make the diving-worker accounting generation-aware before reuse.
As per coding guidelines, flag missing synchronization for shared state in concurrent code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/branch_and_bound/branch_and_bound.cpp` around lines 1717 - 1718,
best_first_search_with() returns the bfs worker to bfs_worker_pool_ immediately
while active diving workers still call worker->set_inactive() and decrement
total_active_diving_workers, causing reuse races; either defer calling
bfs_worker_pool_.return_worker_to_pool(worker) until that worker's
total_active_diving_workers reaches zero (check the worker's
total_active_diving_workers/active_diving_worker_count before returning) or
implement a generation/tag on the BFS worker that diving workers validate before
mutating counters (increment generation on checkout and have set_inactive()
ignore updates if generation mismatches), and add the missing synchronization
(mutex/atomic) around shared counters (total_active_diving_workers) to prevent
races.
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…y space Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
|
/ok to test 2cd842a |
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
|
/ok to test 56c8cff |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/src/branch_and_bound/pseudo_costs.cpp (1)
1350-1392:⚠️ Potential issue | 🟠 MajorInitialize
averagesunconditionally to ensure correct fallback scoring.
averagesis only assigned whenreliable_threshold == 0, butcalculate_pseudocost_score(..., averages)is called at lines 1403, 1519, 1693, and 1773 regardless of the threshold value. Whenreliable_threshold > 0, uninitialized variables withnum_up == 0ornum_down == 0fall back toaverages.up_avgandaverages.down_avg, which remain at their zero-initialized defaults (0.0) instead of the computed averages. This produces incorrect branch scores.Suggested fix
- pseudo_cost_averages_t<i_t, f_t> averages; + pseudo_cost_averages_t<i_t, f_t> averages = compute_averages(); ... - if (reliable_threshold == 0) { - averages = compute_averages(); + if (reliable_threshold == 0) { log.printf("PC: num initialized down %d up %d avg down %e up %e\n", averages.num_init_down, averages.num_init_up, averages.down_avg, averages.up_avg);Also applies to: 1693, 1773
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/branch_and_bound/pseudo_costs.cpp` around lines 1350 - 1392, The variable averages must be initialized unconditionally to avoid zero-default fallbacks in calculate_pseudocost_score; call averages = compute_averages() before computing reliable_threshold (or at least before any calls to calculate_pseudocost_score) so averages.down_avg/up_avg and num_init_* are valid when calculate_pseudocost_score is invoked (refer to pseudo_cost_averages_t and compute_averages, and the calculate_pseudocost_score calls at lines around 1403, 1519, 1693, 1773). Keep the existing log.printf inside the reliable_threshold == 0 branch (or guard it) but ensure averages is assigned regardless of the threshold.
♻️ Duplicate comments (2)
cpp/src/branch_and_bound/branch_and_bound.cpp (1)
1718-1719:⚠️ Potential issue | 🔴 CriticalDon’t recycle the BFS worker while diving tasks still hold its back-pointer.
launch_diving_worker()storesdiving_worker->bfs_worker = bfs_workerand increments that BFS worker’s diving counters, but this function returns the BFS worker to the idle pool immediately after the best-first loop exits. If any dive is still running, it can report back into a recycled worker object and corrupt the new assignment’s accounting. Please keep the BFS worker out of the idle pool untiltotal_active_diving_workersreaches zero, or make the dive completion path generation-aware.As per coding guidelines, "Flag missing synchronization for shared state in concurrent code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/branch_and_bound/branch_and_bound.cpp` around lines 1718 - 1719, The BFS worker is returned to the idle pool while active diving tasks still hold pointers to it (launch_diving_worker sets diving_worker->bfs_worker and increments counters), risking use-after-recycle; modify the lifecycle so the BFS worker is not returned via bfs_worker_pool_.return_worker_to_pool(worker) (and not marked inactive via worker->set_inactive()) until total_active_diving_workers reaches zero, or alternatively change the diving completion path to validate/generate-awarely clear diving_worker->bfs_worker and decrement the BFS worker’s diving counters before any worker object can be returned to the pool; update launch_diving_worker, the dive completion callback, and the code around worker->set_inactive()/bfs_worker_pool_.return_worker_to_pool to enforce this ordering and synchronization.cpp/src/branch_and_bound/worker.hpp (1)
184-186:⚠️ Potential issue | 🟠 MajorRefresh
lower_boundwhen a node is stolen.After
this->node_queue.push(node), this worker still advertises its oldlower_bound. If global gap checks aggregate per-worker bounds, a cheaper stolen node stays invisible until this worker starts processing it.Suggested fix
this->node_queue.lock(); this->node_queue.push(node); + this->lower_bound = std::min(this->lower_bound.load(), node->lower_bound); this->node_queue.unlock();As per coding guidelines, "Flag algorithm correctness errors including logic errors in branch-and-bound implementations."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/branch_and_bound/worker.hpp` around lines 184 - 186, After pushing a stolen node into this->node_queue (at the this->node_queue.push(node) site) the worker still advertises an out-of-date this->lower_bound; update it immediately so global gap checks see the cheaper bound. Fix: after pushing (and/or after releasing the node_queue lock) recompute and publish the local lower bound by invoking the same routine used elsewhere to derive the worker bound (e.g., call the existing update_lower_bound()/recompute_lower_bound_from_queue() helper or inline the same computation that sets this->lower_bound), ensuring any required synchronization or atomics are used when writing this->lower_bound.
🧹 Nitpick comments (1)
cpp/src/branch_and_bound/pseudo_costs.cpp (1)
750-759: Document the warm-start PDLP tolerance choice.
pdlp_tolerance = 1e-5drives all six stopping tolerances here, but the code does not explain why this is the right accuracy/cost tradeoff for strong branching warm starts. A brief rationale or a named setting would make later tuning much safer.As per coding guidelines, "Flag missing documentation for numerical tolerances and algorithm parameters."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/branch_and_bound/pseudo_costs.cpp` around lines 750 - 759, Add a short comment and/or named constant explaining why pdlp_tolerance = 1e-5 is chosen for warm-start PDLP and that it trades accuracy for speed for strong-branching warm starts: document that this tolerance is relaxed/tightened to balance warm-start cost vs branching quality, mention any empirical basis (e.g., "empirically found to preserve branching decisions while reducing solve time") and the related setting name pdlp_tolerance which is applied to ws_settings.tolerances (relative/absolute dual/primal/gap); alternatively pull the literal into a clearly named constant (e.g., warm_start_pdlp_tolerance) with the comment so future tuning is safer and traceable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/branch_and_bound/branch_and_bound.cpp`:
- Around line 3838-3850: Before calling dual_phase2_with_advanced_basis (and
similarly solve_linear_program_with_advanced_basis) for a deterministic dive,
restore the node’s packed basis into the worker’s vstatus: copy
node_ptr->packed_vstatus into worker.leaf_vstatus (or call the helper that
rebuilds vstatus from packed_vstatus) so the advanced basis matches the current
node; do this prior to using worker.leaf_vstatus in the LP call (references:
worker.leaf_vstatus, node_ptr->packed_vstatus, dual_phase2_with_advanced_basis,
solve_linear_program_with_advanced_basis).
In `@cpp/src/branch_and_bound/pseudo_costs.cpp`:
- Around line 784-789: The async device_uvector copies into
cache->initial_primal/initial_dual are performed on the warm-start stream
(ws_primal.stream()) but cache->populated is set immediately, causing races when
read later by batch_pdlp_reliability_branching_task using a different
batch_pdlp_handle stream; fix by synchronizing the source and consumer streams:
after creating the device_uvector copies (the code that sets
cache->initial_primal and cache->initial_dual from ws_primal/ws_dual and before
setting cache->populated) record an event on the warm-start stream (or call
stream.synchronize()) and make the batch_pdlp_handle stream wait on that event
(or synchronize) so readers in batch_pdlp_reliability_branching_task only see
populated=true after the copies are complete; apply the same pattern for both
warm-start places that set cache->populated and for the corresponding reader
path in batch_pdlp_reliability_branching_task.
In `@cpp/src/dual_simplex/initial_basis.cpp`:
- Around line 21-45: The current 2-bit packing in encode()/decode() is lossy (5
states need representation); fix by widening the packed representation to 3
bits: in encode(variable_status_t) assign unique 3-bit codes for BASIC,
NONBASIC_LOWER, NONBASIC_UPPER, NONBASIC_FIXED and NONBASIC_FREE (e.g. 0..4),
and in decode(uint8_t) mask with 0b111 and map those codes back to the exact
enum values; update any assertions to still reject SUPERBASIC and ensure both
functions use the same 3-bit scheme so packing/unpacking is lossless.
- Around line 76-85: The assert in decompress_vstatus is checking the wrong
value: it compares vstatus_size (number of unpacked statuses) against n +
has_tail (expected number of packed bytes). Update the assertion to validate the
input packed size by asserting that packed_vstatus.size() == n + has_tail so the
function verifies the number of packed bytes matches the computed count; keep
the rest of decompress_vstatus (variables n, has_tail, vstatus) unchanged.
---
Outside diff comments:
In `@cpp/src/branch_and_bound/pseudo_costs.cpp`:
- Around line 1350-1392: The variable averages must be initialized
unconditionally to avoid zero-default fallbacks in calculate_pseudocost_score;
call averages = compute_averages() before computing reliable_threshold (or at
least before any calls to calculate_pseudocost_score) so
averages.down_avg/up_avg and num_init_* are valid when
calculate_pseudocost_score is invoked (refer to pseudo_cost_averages_t and
compute_averages, and the calculate_pseudocost_score calls at lines around 1403,
1519, 1693, 1773). Keep the existing log.printf inside the reliable_threshold ==
0 branch (or guard it) but ensure averages is assigned regardless of the
threshold.
---
Duplicate comments:
In `@cpp/src/branch_and_bound/branch_and_bound.cpp`:
- Around line 1718-1719: The BFS worker is returned to the idle pool while
active diving tasks still hold pointers to it (launch_diving_worker sets
diving_worker->bfs_worker and increments counters), risking use-after-recycle;
modify the lifecycle so the BFS worker is not returned via
bfs_worker_pool_.return_worker_to_pool(worker) (and not marked inactive via
worker->set_inactive()) until total_active_diving_workers reaches zero, or
alternatively change the diving completion path to validate/generate-awarely
clear diving_worker->bfs_worker and decrement the BFS worker’s diving counters
before any worker object can be returned to the pool; update
launch_diving_worker, the dive completion callback, and the code around
worker->set_inactive()/bfs_worker_pool_.return_worker_to_pool to enforce this
ordering and synchronization.
In `@cpp/src/branch_and_bound/worker.hpp`:
- Around line 184-186: After pushing a stolen node into this->node_queue (at the
this->node_queue.push(node) site) the worker still advertises an out-of-date
this->lower_bound; update it immediately so global gap checks see the cheaper
bound. Fix: after pushing (and/or after releasing the node_queue lock) recompute
and publish the local lower bound by invoking the same routine used elsewhere to
derive the worker bound (e.g., call the existing
update_lower_bound()/recompute_lower_bound_from_queue() helper or inline the
same computation that sets this->lower_bound), ensuring any required
synchronization or atomics are used when writing this->lower_bound.
---
Nitpick comments:
In `@cpp/src/branch_and_bound/pseudo_costs.cpp`:
- Around line 750-759: Add a short comment and/or named constant explaining why
pdlp_tolerance = 1e-5 is chosen for warm-start PDLP and that it trades accuracy
for speed for strong-branching warm starts: document that this tolerance is
relaxed/tightened to balance warm-start cost vs branching quality, mention any
empirical basis (e.g., "empirically found to preserve branching decisions while
reducing solve time") and the related setting name pdlp_tolerance which is
applied to ws_settings.tolerances (relative/absolute dual/primal/gap);
alternatively pull the literal into a clearly named constant (e.g.,
warm_start_pdlp_tolerance) with the comment so future tuning is safer and
traceable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 753b46ac-59b0-4cf5-b3eb-a6f17a6e04e6
📒 Files selected for processing (10)
cpp/src/branch_and_bound/branch_and_bound.cppcpp/src/branch_and_bound/branch_and_bound.hppcpp/src/branch_and_bound/mip_node.hppcpp/src/branch_and_bound/node_queue.hppcpp/src/branch_and_bound/pseudo_costs.cppcpp/src/branch_and_bound/pseudo_costs.hppcpp/src/branch_and_bound/worker.hppcpp/src/dual_simplex/initial_basis.cppcpp/src/dual_simplex/initial_basis.hppcpp/src/utilities/omp_helpers.hpp
✅ Files skipped from review due to trivial changes (1)
- cpp/src/dual_simplex/initial_basis.hpp
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/src/utilities/omp_helpers.hpp
| cache->initial_primal = rmm::device_uvector<f_t>(ws_primal, ws_primal.stream()); | ||
| cache->initial_dual = rmm::device_uvector<f_t>(ws_dual, ws_dual.stream()); | ||
| cache->step_size = ws_solution.get_pdlp_warm_start_data().initial_step_size_; | ||
| cache->primal_weight = ws_solution.get_pdlp_warm_start_data().initial_primal_weight_; | ||
| cache->pdlp_iteration = ws_solution.get_pdlp_warm_start_data().total_pdlp_iterations_; | ||
| cache->populated = true; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
wc -l cpp/src/branch_and_bound/pseudo_costs.cppRepository: NVIDIA/cuopt
Length of output: 103
🏁 Script executed:
sed -n '780,795p' cpp/src/branch_and_bound/pseudo_costs.cppRepository: NVIDIA/cuopt
Length of output: 1003
🏁 Script executed:
sed -n '938,960p' cpp/src/branch_and_bound/pseudo_costs.cppRepository: NVIDIA/cuopt
Length of output: 1195
🏁 Script executed:
sed -n '750,800p' cpp/src/branch_and_bound/pseudo_costs.cppRepository: NVIDIA/cuopt
Length of output: 2698
🏁 Script executed:
sed -n '789,943p' cpp/src/branch_and_bound/pseudo_costs.cpp | head -100Repository: NVIDIA/cuopt
Length of output: 3901
🏁 Script executed:
sed -n '820,900p' cpp/src/branch_and_bound/pseudo_costs.cppRepository: NVIDIA/cuopt
Length of output: 3474
🏁 Script executed:
grep -n "cudaStreamSynchronize\|cuda_sync\|CUDA_SYNC\|raft::synchronize" cpp/src/branch_and_bound/pseudo_costs.cppRepository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
sed -n '770,800p' cpp/src/branch_and_bound/pseudo_costs.cppRepository: NVIDIA/cuopt
Length of output: 1561
🏁 Script executed:
grep -n "get_primal_solution\|get_dual_solution" cpp/src/branch_and_bound/pseudo_costs.cpp | head -20Repository: NVIDIA/cuopt
Length of output: 192
🏁 Script executed:
find . -name "*.hpp" -o -name "*.h" | xargs grep -l "get_primal_solution\|get_dual_solution" | head -5Repository: NVIDIA/cuopt
Length of output: 377
🏁 Script executed:
grep -r "struct.*solution\|class.*solution" cpp/include --include="*.hpp" --include="*.h" | grep -i pdlp | head -10Repository: NVIDIA/cuopt
Length of output: 760
🏁 Script executed:
sed -n '1,100p' cpp/include/cuopt/linear_programming/pdlp/solver_solution.hppRepository: NVIDIA/cuopt
Length of output: 4223
🏁 Script executed:
grep -A 20 "get_primal_solution\|get_dual_solution" cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp | head -50Repository: NVIDIA/cuopt
Length of output: 1067
🏁 Script executed:
cat cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp | grep -A 5 "class optimization_problem_solution_t"Repository: NVIDIA/cuopt
Length of output: 317
🏁 Script executed:
grep -B 5 -A 10 "ws_primal.stream()\|ws_dual.stream()" cpp/src/branch_and_bound/pseudo_costs.cppRepository: NVIDIA/cuopt
Length of output: 1053
🏁 Script executed:
sed -n '758,775p' cpp/src/branch_and_bound/pseudo_costs.cppRepository: NVIDIA/cuopt
Length of output: 920
🏁 Script executed:
grep -n "batch_pdlp_handle\|pdlp_warm_cache" cpp/src/branch_and_bound/pseudo_costs.cpp | head -30Repository: NVIDIA/cuopt
Length of output: 2369
🏁 Script executed:
sed -n '930,960p' cpp/src/branch_and_bound/pseudo_costs.cppRepository: NVIDIA/cuopt
Length of output: 1504
🏁 Script executed:
grep -B 10 "batch_pdlp_warm_cache_t" cpp/src/branch_and_bound/pseudo_costs.cpp | head -30Repository: NVIDIA/cuopt
Length of output: 511
🏁 Script executed:
grep -n "struct batch_pdlp_warm_cache" cpp/include -r --include="*.hpp" --include="*.h"Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
find cpp -name "*.hpp" -o -name "*.h" | xargs grep -l "batch_pdlp_warm_cache" | head -5Repository: NVIDIA/cuopt
Length of output: 98
🏁 Script executed:
grep -n "batch_pdlp_warm_cache_t" cpp/src/branch_and_bound/pseudo_costs.cpp | head -5Repository: NVIDIA/cuopt
Length of output: 114
🏁 Script executed:
rg "struct batch_pdlp_warm_cache_t|class batch_pdlp_warm_cache_t" --type hpp --type hRepository: NVIDIA/cuopt
Length of output: 84
🏁 Script executed:
cat cpp/src/branch_and_bound/pseudo_costs.hppRepository: NVIDIA/cuopt
Length of output: 12542
🏁 Script executed:
sed -n '789,943p' cpp/src/branch_and_bound/pseudo_costs.cpp | grep -i "sync\|event\|stream\|barrier"Repository: NVIDIA/cuopt
Length of output: 226
🏁 Script executed:
sed -n '800,825p' cpp/src/branch_and_bound/pseudo_costs.cppRepository: NVIDIA/cuopt
Length of output: 890
Synchronize cache visibility before reads from different stream contexts.
Cache vectors are copied asynchronously on the warm-start handle's stream (lines 784-789), but the populated flag is set immediately without synchronization. Later, batch_pdlp_reliability_branching_task reads the cache through a different batch_pdlp_handle (lines 943-952) with no ordering guarantee. Since the handle's streams are independent, the async copy may not complete before the read, causing races. This applies to both code locations (lines 784-789 and 943-952).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/branch_and_bound/pseudo_costs.cpp` around lines 784 - 789, The async
device_uvector copies into cache->initial_primal/initial_dual are performed on
the warm-start stream (ws_primal.stream()) but cache->populated is set
immediately, causing races when read later by
batch_pdlp_reliability_branching_task using a different batch_pdlp_handle
stream; fix by synchronizing the source and consumer streams: after creating the
device_uvector copies (the code that sets cache->initial_primal and
cache->initial_dual from ws_primal/ws_dual and before setting cache->populated)
record an event on the warm-start stream (or call stream.synchronize()) and make
the batch_pdlp_handle stream wait on that event (or synchronize) so readers in
batch_pdlp_reliability_branching_task only see populated=true after the copies
are complete; apply the same pattern for both warm-start places that set
cache->populated and for the corresponding reader path in
batch_pdlp_reliability_branching_task.
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
|
/ok to test 54d3f66 |
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts: # cpp/src/branch_and_bound/branch_and_bound.cpp # cpp/src/branch_and_bound/constants.hpp # cpp/src/branch_and_bound/pseudo_costs.hpp # cpp/src/utilities/omp_helpers.hpp
|
/ok to test 2adec89 |
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
|
/ok to test bc3d50b |
In this PR, each best-first worker has its own local node heap, such that it push/pop nodes without synchronizing with other workers. Each best-first worker periodically steals a node from a random worker to keep the node distribution more or less balance across them. Additionally, each best-first worker has a (fixed) set of diving worker assigned to it, which are used for performing diving on its own nodes whenever possible. This essentially eliminates the need of the scheduler thread, freeing one additional thread to do something useful.
Includes #1020.
MIPLIB results (GH200, 10min):
In summary, we explored
~3xnodes in average` at the same time frame. The number of optimal solutions also increased by 6.Checklist