B&B with Local Heaps by nguidotti · Pull Request #1149 · NVIDIA/cuopt

nguidotti · 2026-04-27T11:22:35Z

In this PR, each best-first worker has its own local node heap, such that it push/pop nodes without synchronizing with other workers. Each best-first worker periodically steals a node from a random worker to keep the node distribution more or less balance across them. Additionally, each best-first worker has a (fixed) set of diving worker assigned to it, which are used for performing diving on its own nodes whenever possible. This essentially eliminates the need of the scheduler thread, freeing one additional thread to do something useful.

Includes #1020.

MIPLIB results (GH200, 10min):

================================================================================
 main-100426-1 (1) vs bnb-local-heap (2)
================================================================================

------------------------------------------------------------------------------------------------------------------------------
|                                        |       Run 1        |       Run 2        |     Abs. Diff.     |   Rel. Diff. (%)   |
------------------------------------------------------------------------------------------------------------------------------
| Feasible                                                 227                  226                   -1                 --- |
| Optimal                                                   71                   77                   +6                 --- |
| Solutions with <0.1% primal gap                          124                  130                   +6                 --- |
| Nodes explored (mean)                              4.484e+06            1.316e+07           +8.675e+06                +193 |
| Nodes explored (shifted geomean)                        7395            1.202e+04                +4620               +62.5 |
| Relative MIP gap (mean)                               0.3361               0.3492             +0.01306               +3.89 |
| Relative MIP gap (shifted geomean)                     0.116               0.1146            -0.001384               -1.19 |
| Solve time (mean)                                      449.2                436.4                -12.8               -2.85 |
| Solve time (shifted geomean)                           231.9                212.8               -19.03               -8.21 |
| Primal gap (mean)                                      11.26                11.77              +0.5112               +4.54 |
| Primal gap (shifted geomean)                           0.602               0.5709             -0.03103               -5.15 |
| Primal integral (mean)                                 33.85                34.99               +1.142               +3.37 |
| Primal integral (shifted geomean)                      6.655                 6.89              +0.2345               +3.52 |
------------------------------------------------------------------------------------------------------------------------------

In summary, we explored ~3x nodes in average` at the same time frame. The number of optimal solutions also increased by 6.

Checklist

I am familiar with the Contributing Guidelines.
Testing
- New or existing tests cover these changes
- Added tests
- Created an issue to follow-up
- NA
Documentation
- The documentation is up to date with these changes
- Added new documentation
- NA

Remove dependency on rmm::mr::device_memory_resource base class. Resources now satisfy the cuda::mr::resource concept directly. - Replace shared_ptr<device_memory_resource> with value types and cuda::mr::any_resource<cuda::mr::device_accessible> for type-erased storage - Replace set_current_device_resource(ptr) with set_current_device_resource_ref - Replace set_per_device_resource(id, ptr) with set_per_device_resource_ref - Remove make_owning_wrapper usage - Remove dynamic_cast on memory resources (no common base class) - Remove owning_wrapper.hpp and device_memory_resource.hpp includes - Add missing thrust/iterator/transform_output_iterator.h include (no longer transitively included via CCCL)

…nd deterministic mode. Signed-off-by: Nicolas Guidotti <224634272+nguidotti@users.noreply.github.com>

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

… shared_ptr to avoid unnecessary copy. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

…l crash in work-stealing Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

…queue for now. refactoring. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

… are present Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

# Conflicts: # cpp/src/utilities/cuda_helpers.cuh

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

# Conflicts: # ci/validate_wheel.sh

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

cpp/src/branch_and_bound/branch_and_bound.cpp (1)

1687-1692: Decrement nodes_unexplored when a queued node is cut off.

This path prunes a node that was previously counted as unexplored, but the counter is never decremented here. That leaves the solver stats inconsistent and can skew the final summary.

Suggested fix

     if (upper_bound_.load() < start_node->lower_bound) {
       // This node was put on the heap earlier but its lower bound is now greater than the
       // current upper bound
       search_tree_.graphviz_node(settings_.log, start_node, "cutoff", start_node->lower_bound);
       search_tree_.update(start_node, node_status_t::FATHOMED);
+      --exploration_stats_.nodes_unexplored;
       continue;
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/branch_and_bound/branch_and_bound.cpp` around lines 1687 - 1692, The
cutoff branch that checks if upper_bound_.load() < start_node->lower_bound
currently marks the node as FATHOMED but doesn't decrement the unexplored
counter; update that if block to decrement nodes_unexplored when a queued node
is pruned (i.e., right after search_tree_.update(start_node,
node_status_t::FATHOMED)), using the appropriate operation for the counter's
type (nodes_unexplored-- or nodes_unexplored.fetch_sub(1) if it's atomic) so
solver stats remain consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/branch_and_bound/branch_and_bound.cpp`:
- Around line 1717-1718: best_first_search_with() returns the bfs worker to
bfs_worker_pool_ immediately while active diving workers still call
worker->set_inactive() and decrement total_active_diving_workers, causing reuse
races; either defer calling bfs_worker_pool_.return_worker_to_pool(worker) until
that worker's total_active_diving_workers reaches zero (check the worker's
total_active_diving_workers/active_diving_worker_count before returning) or
implement a generation/tag on the BFS worker that diving workers validate before
mutating counters (increment generation on checkout and have set_inactive()
ignore updates if generation mismatches), and add the missing synchronization
(mutex/atomic) around shared counters (total_active_diving_workers) to prevent
races.

In `@cpp/src/branch_and_bound/worker.hpp`:
- Around line 163-189: After successfully stealing a node in
bfs_worker_t::steal_node_from (i.e., after this->node_queue.push(node)), update
the thief worker’s advertised lower_bound to reflect the stolen node so
worker_pool_t::get_lower_bound() sees it; specifically, while holding the
thief’s node_queue lock (the same lock used to push), set this->lower_bound =
std::min(this->lower_bound, <stolen node’s lower-bound field>) or otherwise
assign the node’s lower-bound value if it is smaller, ensuring thread-safety and
correctness of the global lower-bound computation.

---

Nitpick comments:
In `@cpp/src/branch_and_bound/branch_and_bound.cpp`:
- Around line 1687-1692: The cutoff branch that checks if upper_bound_.load() <
start_node->lower_bound currently marks the node as FATHOMED but doesn't
decrement the unexplored counter; update that if block to decrement
nodes_unexplored when a queued node is pruned (i.e., right after
search_tree_.update(start_node, node_status_t::FATHOMED)), using the appropriate
operation for the counter's type (nodes_unexplored-- or
nodes_unexplored.fetch_sub(1) if it's atomic) so solver stats remain consistent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 753ef90d-0840-45d4-aeee-cfb63a2093fa

📥 Commits

Reviewing files that changed from the base of the PR and between d936e05 and a67f644.

📒 Files selected for processing (5)

cpp/src/branch_and_bound/branch_and_bound.cpp
cpp/src/branch_and_bound/branch_and_bound.hpp
cpp/src/branch_and_bound/mip_node.hpp
cpp/src/branch_and_bound/worker.hpp
cpp/src/branch_and_bound/worker_pool.hpp

🚧 Files skipped from review as they are similar to previous changes (1)

cpp/src/branch_and_bound/mip_node.hpp

coderabbitai · 2026-04-28T08:29:43Z

+  worker->set_inactive();
+  bfs_worker_pool_.return_worker_to_pool(worker);


⚠️ Potential issue | 🔴 Critical

Don’t recycle a BFS worker while its diving workers still reference it.

best_first_search_with() returns the worker to bfs_worker_pool_ immediately, but cpp/src/branch_and_bound/worker.hpp:271-281 shows each active diving worker still decrements counters on that same bfs_worker in set_inactive(). If this object gets reused first, old dives will mutate the new assignment’s accounting and can underflow/assert on completion.

Hold the BFS worker out of the idle pool until total_active_diving_workers reaches zero, or make the diving-worker accounting generation-aware before reuse.

As per coding guidelines, flag missing synchronization for shared state in concurrent code.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/src/branch_and_bound/branch_and_bound.cpp` around lines 1717 - 1718, best_first_search_with() returns the bfs worker to bfs_worker_pool_ immediately while active diving workers still call worker->set_inactive() and decrement total_active_diving_workers, causing reuse races; either defer calling bfs_worker_pool_.return_worker_to_pool(worker) until that worker's total_active_diving_workers reaches zero (check the worker's total_active_diving_workers/active_diving_worker_count before returning) or implement a generation/tag on the BFS worker that diving workers validate before mutating counters (increment generation on checkout and have set_inactive() ignore updates if generation mismatches), and add the missing synchronization (mutex/atomic) around shared counters (total_active_diving_workers) to prevent races.

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

…y space Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-04-28T12:51:22Z

/ok to test 2cd842a

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-04-28T12:55:44Z

/ok to test 56c8cff

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/src/branch_and_bound/pseudo_costs.cpp (1)
1350-1392: ⚠️ Potential issue | 🟠 Major

Initialize averages unconditionally to ensure correct fallback scoring.

averages is only assigned when reliable_threshold == 0, but calculate_pseudocost_score(..., averages) is called at lines 1403, 1519, 1693, and 1773 regardless of the threshold value. When reliable_threshold > 0, uninitialized variables with num_up == 0 or num_down == 0 fall back to averages.up_avg and averages.down_avg, which remain at their zero-initialized defaults (0.0) instead of the computed averages. This produces incorrect branch scores.
Suggested fix
-  pseudo_cost_averages_t<i_t, f_t> averages;
+  pseudo_cost_averages_t<i_t, f_t> averages = compute_averages();
...
-  if (reliable_threshold == 0) {
-    averages = compute_averages();
+  if (reliable_threshold == 0) {
     log.printf("PC: num initialized down %d up %d avg down %e up %e\n",
                averages.num_init_down,
                averages.num_init_up,
                averages.down_avg,
                averages.up_avg);
Also applies to: 1693, 1773
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/branch_and_bound/pseudo_costs.cpp` around lines 1350 - 1392, The
variable averages must be initialized unconditionally to avoid zero-default
fallbacks in calculate_pseudocost_score; call averages = compute_averages()
before computing reliable_threshold (or at least before any calls to
calculate_pseudocost_score) so averages.down_avg/up_avg and num_init_* are valid
when calculate_pseudocost_score is invoked (refer to pseudo_cost_averages_t and
compute_averages, and the calculate_pseudocost_score calls at lines around 1403,
1519, 1693, 1773). Keep the existing log.printf inside the reliable_threshold ==
0 branch (or guard it) but ensure averages is assigned regardless of the
threshold.

♻️ Duplicate comments (2)

cpp/src/branch_and_bound/branch_and_bound.cpp (1)
1718-1719: ⚠️ Potential issue | 🔴 Critical

Don’t recycle the BFS worker while diving tasks still hold its back-pointer.

launch_diving_worker() stores diving_worker->bfs_worker = bfs_worker and increments that BFS worker’s diving counters, but this function returns the BFS worker to the idle pool immediately after the best-first loop exits. If any dive is still running, it can report back into a recycled worker object and corrupt the new assignment’s accounting. Please keep the BFS worker out of the idle pool until total_active_diving_workers reaches zero, or make the dive completion path generation-aware.

As per coding guidelines, "Flag missing synchronization for shared state in concurrent code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/branch_and_bound/branch_and_bound.cpp` around lines 1718 - 1719, The
BFS worker is returned to the idle pool while active diving tasks still hold
pointers to it (launch_diving_worker sets diving_worker->bfs_worker and
increments counters), risking use-after-recycle; modify the lifecycle so the BFS
worker is not returned via bfs_worker_pool_.return_worker_to_pool(worker) (and
not marked inactive via worker->set_inactive()) until
total_active_diving_workers reaches zero, or alternatively change the diving
completion path to validate/generate-awarely clear diving_worker->bfs_worker and
decrement the BFS worker’s diving counters before any worker object can be
returned to the pool; update launch_diving_worker, the dive completion callback,
and the code around
worker->set_inactive()/bfs_worker_pool_.return_worker_to_pool to enforce this
ordering and synchronization.
cpp/src/branch_and_bound/worker.hpp (1)
184-186: ⚠️ Potential issue | 🟠 Major

Refresh lower_bound when a node is stolen.

After this->node_queue.push(node), this worker still advertises its old lower_bound. If global gap checks aggregate per-worker bounds, a cheaper stolen node stays invisible until this worker starts processing it.
Suggested fix
       this->node_queue.lock();
       this->node_queue.push(node);
+      this->lower_bound = std::min(this->lower_bound.load(), node->lower_bound);
       this->node_queue.unlock();
As per coding guidelines, "Flag algorithm correctness errors including logic errors in branch-and-bound implementations."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/branch_and_bound/worker.hpp` around lines 184 - 186, After pushing a
stolen node into this->node_queue (at the this->node_queue.push(node) site) the
worker still advertises an out-of-date this->lower_bound; update it immediately
so global gap checks see the cheaper bound. Fix: after pushing (and/or after
releasing the node_queue lock) recompute and publish the local lower bound by
invoking the same routine used elsewhere to derive the worker bound (e.g., call
the existing update_lower_bound()/recompute_lower_bound_from_queue() helper or
inline the same computation that sets this->lower_bound), ensuring any required
synchronization or atomics are used when writing this->lower_bound.

🧹 Nitpick comments (1)

cpp/src/branch_and_bound/pseudo_costs.cpp (1)
750-759: Document the warm-start PDLP tolerance choice.

pdlp_tolerance = 1e-5 drives all six stopping tolerances here, but the code does not explain why this is the right accuracy/cost tradeoff for strong branching warm starts. A brief rationale or a named setting would make later tuning much safer.

As per coding guidelines, "Flag missing documentation for numerical tolerances and algorithm parameters."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/branch_and_bound/pseudo_costs.cpp` around lines 750 - 759, Add a
short comment and/or named constant explaining why pdlp_tolerance = 1e-5 is
chosen for warm-start PDLP and that it trades accuracy for speed for
strong-branching warm starts: document that this tolerance is relaxed/tightened
to balance warm-start cost vs branching quality, mention any empirical basis
(e.g., "empirically found to preserve branching decisions while reducing solve
time") and the related setting name pdlp_tolerance which is applied to
ws_settings.tolerances (relative/absolute dual/primal/gap); alternatively pull
the literal into a clearly named constant (e.g., warm_start_pdlp_tolerance) with
the comment so future tuning is safer and traceable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/branch_and_bound/branch_and_bound.cpp`:
- Around line 3838-3850: Before calling dual_phase2_with_advanced_basis (and
similarly solve_linear_program_with_advanced_basis) for a deterministic dive,
restore the node’s packed basis into the worker’s vstatus: copy
node_ptr->packed_vstatus into worker.leaf_vstatus (or call the helper that
rebuilds vstatus from packed_vstatus) so the advanced basis matches the current
node; do this prior to using worker.leaf_vstatus in the LP call (references:
worker.leaf_vstatus, node_ptr->packed_vstatus, dual_phase2_with_advanced_basis,
solve_linear_program_with_advanced_basis).

In `@cpp/src/branch_and_bound/pseudo_costs.cpp`:
- Around line 784-789: The async device_uvector copies into
cache->initial_primal/initial_dual are performed on the warm-start stream
(ws_primal.stream()) but cache->populated is set immediately, causing races when
read later by batch_pdlp_reliability_branching_task using a different
batch_pdlp_handle stream; fix by synchronizing the source and consumer streams:
after creating the device_uvector copies (the code that sets
cache->initial_primal and cache->initial_dual from ws_primal/ws_dual and before
setting cache->populated) record an event on the warm-start stream (or call
stream.synchronize()) and make the batch_pdlp_handle stream wait on that event
(or synchronize) so readers in batch_pdlp_reliability_branching_task only see
populated=true after the copies are complete; apply the same pattern for both
warm-start places that set cache->populated and for the corresponding reader
path in batch_pdlp_reliability_branching_task.

In `@cpp/src/dual_simplex/initial_basis.cpp`:
- Around line 21-45: The current 2-bit packing in encode()/decode() is lossy (5
states need representation); fix by widening the packed representation to 3
bits: in encode(variable_status_t) assign unique 3-bit codes for BASIC,
NONBASIC_LOWER, NONBASIC_UPPER, NONBASIC_FIXED and NONBASIC_FREE (e.g. 0..4),
and in decode(uint8_t) mask with 0b111 and map those codes back to the exact
enum values; update any assertions to still reject SUPERBASIC and ensure both
functions use the same 3-bit scheme so packing/unpacking is lossless.
- Around line 76-85: The assert in decompress_vstatus is checking the wrong
value: it compares vstatus_size (number of unpacked statuses) against n +
has_tail (expected number of packed bytes). Update the assertion to validate the
input packed size by asserting that packed_vstatus.size() == n + has_tail so the
function verifies the number of packed bytes matches the computed count; keep
the rest of decompress_vstatus (variables n, has_tail, vstatus) unchanged.

---

Outside diff comments:
In `@cpp/src/branch_and_bound/pseudo_costs.cpp`:
- Around line 1350-1392: The variable averages must be initialized
unconditionally to avoid zero-default fallbacks in calculate_pseudocost_score;
call averages = compute_averages() before computing reliable_threshold (or at
least before any calls to calculate_pseudocost_score) so
averages.down_avg/up_avg and num_init_* are valid when
calculate_pseudocost_score is invoked (refer to pseudo_cost_averages_t and
compute_averages, and the calculate_pseudocost_score calls at lines around 1403,
1519, 1693, 1773). Keep the existing log.printf inside the reliable_threshold ==
0 branch (or guard it) but ensure averages is assigned regardless of the
threshold.

---

Duplicate comments:
In `@cpp/src/branch_and_bound/branch_and_bound.cpp`:
- Around line 1718-1719: The BFS worker is returned to the idle pool while
active diving tasks still hold pointers to it (launch_diving_worker sets
diving_worker->bfs_worker and increments counters), risking use-after-recycle;
modify the lifecycle so the BFS worker is not returned via
bfs_worker_pool_.return_worker_to_pool(worker) (and not marked inactive via
worker->set_inactive()) until total_active_diving_workers reaches zero, or
alternatively change the diving completion path to validate/generate-awarely
clear diving_worker->bfs_worker and decrement the BFS worker’s diving counters
before any worker object can be returned to the pool; update
launch_diving_worker, the dive completion callback, and the code around
worker->set_inactive()/bfs_worker_pool_.return_worker_to_pool to enforce this
ordering and synchronization.

In `@cpp/src/branch_and_bound/worker.hpp`:
- Around line 184-186: After pushing a stolen node into this->node_queue (at the
this->node_queue.push(node) site) the worker still advertises an out-of-date
this->lower_bound; update it immediately so global gap checks see the cheaper
bound. Fix: after pushing (and/or after releasing the node_queue lock) recompute
and publish the local lower bound by invoking the same routine used elsewhere to
derive the worker bound (e.g., call the existing
update_lower_bound()/recompute_lower_bound_from_queue() helper or inline the
same computation that sets this->lower_bound), ensuring any required
synchronization or atomics are used when writing this->lower_bound.

---

Nitpick comments:
In `@cpp/src/branch_and_bound/pseudo_costs.cpp`:
- Around line 750-759: Add a short comment and/or named constant explaining why
pdlp_tolerance = 1e-5 is chosen for warm-start PDLP and that it trades accuracy
for speed for strong-branching warm starts: document that this tolerance is
relaxed/tightened to balance warm-start cost vs branching quality, mention any
empirical basis (e.g., "empirically found to preserve branching decisions while
reducing solve time") and the related setting name pdlp_tolerance which is
applied to ws_settings.tolerances (relative/absolute dual/primal/gap);
alternatively pull the literal into a clearly named constant (e.g.,
warm_start_pdlp_tolerance) with the comment so future tuning is safer and
traceable.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 753b46ac-59b0-4cf5-b3eb-a6f17a6e04e6

📥 Commits

Reviewing files that changed from the base of the PR and between a67f644 and 7df7aa1.

📒 Files selected for processing (10)

cpp/src/branch_and_bound/branch_and_bound.cpp
cpp/src/branch_and_bound/branch_and_bound.hpp
cpp/src/branch_and_bound/mip_node.hpp
cpp/src/branch_and_bound/node_queue.hpp
cpp/src/branch_and_bound/pseudo_costs.cpp
cpp/src/branch_and_bound/pseudo_costs.hpp
cpp/src/branch_and_bound/worker.hpp
cpp/src/dual_simplex/initial_basis.cpp
cpp/src/dual_simplex/initial_basis.hpp
cpp/src/utilities/omp_helpers.hpp

✅ Files skipped from review due to trivial changes (1)

cpp/src/dual_simplex/initial_basis.hpp

🚧 Files skipped from review as they are similar to previous changes (1)

cpp/src/utilities/omp_helpers.hpp

coderabbitai · 2026-04-28T12:58:45Z

+      cache->initial_primal = rmm::device_uvector<f_t>(ws_primal, ws_primal.stream());
+      cache->initial_dual   = rmm::device_uvector<f_t>(ws_dual, ws_dual.stream());
+      cache->step_size      = ws_solution.get_pdlp_warm_start_data().initial_step_size_;
+      cache->primal_weight  = ws_solution.get_pdlp_warm_start_data().initial_primal_weight_;
+      cache->pdlp_iteration = ws_solution.get_pdlp_warm_start_data().total_pdlp_iterations_;
+      cache->populated      = true;


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 103

🏁 Script executed:

sed -n '780,795p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 1003

🏁 Script executed:

sed -n '938,960p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 1195

🏁 Script executed:

sed -n '750,800p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 2698

🏁 Script executed:

sed -n '789,943p' cpp/src/branch_and_bound/pseudo_costs.cpp | head -100

Repository: NVIDIA/cuopt

Length of output: 3901

🏁 Script executed:

sed -n '820,900p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 3474

🏁 Script executed:

grep -n "cudaStreamSynchronize\|cuda_sync\|CUDA_SYNC\|raft::synchronize" cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 38

🏁 Script executed:

sed -n '770,800p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 1561

🏁 Script executed:

grep -n "get_primal_solution\|get_dual_solution" cpp/src/branch_and_bound/pseudo_costs.cpp | head -20

Repository: NVIDIA/cuopt

Length of output: 192

🏁 Script executed:

find . -name "*.hpp" -o -name "*.h" | xargs grep -l "get_primal_solution\|get_dual_solution" | head -5

Repository: NVIDIA/cuopt

Length of output: 377

🏁 Script executed:

grep -r "struct.*solution\|class.*solution" cpp/include --include="*.hpp" --include="*.h" | grep -i pdlp | head -10

Repository: NVIDIA/cuopt

Length of output: 760

🏁 Script executed:

sed -n '1,100p' cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp

Repository: NVIDIA/cuopt

Length of output: 4223

🏁 Script executed:

grep -A 20 "get_primal_solution\|get_dual_solution" cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp | head -50

Repository: NVIDIA/cuopt

Length of output: 1067

🏁 Script executed:

cat cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp | grep -A 5 "class optimization_problem_solution_t"

Repository: NVIDIA/cuopt

Length of output: 317

🏁 Script executed:

grep -B 5 -A 10 "ws_primal.stream()\|ws_dual.stream()" cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 1053

🏁 Script executed:

sed -n '758,775p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 920

🏁 Script executed:

grep -n "batch_pdlp_handle\|pdlp_warm_cache" cpp/src/branch_and_bound/pseudo_costs.cpp | head -30

Repository: NVIDIA/cuopt

Length of output: 2369

🏁 Script executed:

sed -n '930,960p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 1504

🏁 Script executed:

grep -B 10 "batch_pdlp_warm_cache_t" cpp/src/branch_and_bound/pseudo_costs.cpp | head -30

Repository: NVIDIA/cuopt

Length of output: 511

🏁 Script executed:

grep -n "struct batch_pdlp_warm_cache" cpp/include -r --include="*.hpp" --include="*.h"

Repository: NVIDIA/cuopt

Length of output: 38

🏁 Script executed:

find cpp -name "*.hpp" -o -name "*.h" | xargs grep -l "batch_pdlp_warm_cache" | head -5

Repository: NVIDIA/cuopt

Length of output: 98

🏁 Script executed:

grep -n "batch_pdlp_warm_cache_t" cpp/src/branch_and_bound/pseudo_costs.cpp | head -5

Repository: NVIDIA/cuopt

Length of output: 114

🏁 Script executed:

rg "struct batch_pdlp_warm_cache_t|class batch_pdlp_warm_cache_t" --type hpp --type h

Repository: NVIDIA/cuopt

Length of output: 84

🏁 Script executed:

cat cpp/src/branch_and_bound/pseudo_costs.hpp

Repository: NVIDIA/cuopt

Length of output: 12542

🏁 Script executed:

sed -n '789,943p' cpp/src/branch_and_bound/pseudo_costs.cpp | grep -i "sync\|event\|stream\|barrier"

Repository: NVIDIA/cuopt

Length of output: 226

🏁 Script executed:

sed -n '800,825p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 890

Synchronize cache visibility before reads from different stream contexts.

Cache vectors are copied asynchronously on the warm-start handle's stream (lines 784-789), but the populated flag is set immediately without synchronization. Later, batch_pdlp_reliability_branching_task reads the cache through a different batch_pdlp_handle (lines 943-952) with no ordering guarantee. Since the handle's streams are independent, the async copy may not complete before the read, causing races. This applies to both code locations (lines 784-789 and 943-952).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/src/branch_and_bound/pseudo_costs.cpp` around lines 784 - 789, The async device_uvector copies into cache->initial_primal/initial_dual are performed on the warm-start stream (ws_primal.stream()) but cache->populated is set immediately, causing races when read later by batch_pdlp_reliability_branching_task using a different batch_pdlp_handle stream; fix by synchronizing the source and consumer streams: after creating the device_uvector copies (the code that sets cache->initial_primal and cache->initial_dual from ws_primal/ws_dual and before setting cache->populated) record an event on the warm-start stream (or call stream.synchronize()) and make the batch_pdlp_handle stream wait on that event (or synchronize) so readers in batch_pdlp_reliability_branching_task only see populated=true after the copies are complete; apply the same pattern for both warm-start places that set cache->populated and for the corresponding reader path in batch_pdlp_reliability_branching_task.

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-04-28T14:37:54Z

/ok to test 54d3f66

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

# Conflicts: # cpp/src/branch_and_bound/branch_and_bound.cpp # cpp/src/branch_and_bound/constants.hpp # cpp/src/branch_and_bound/pseudo_costs.hpp # cpp/src/utilities/omp_helpers.hpp

nguidotti · 2026-05-05T09:32:09Z

/ok to test 2adec89

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-05T11:25:58Z

/ok to test bc3d50b

bdice and others added 30 commits April 3, 2026 13:51

split worker and worker pool in separated file. code cleanup.

e77dbc2

simplified logic for pseudo cost (and its snapshot) for the regular a…

62d0452

…nd deterministic mode. Signed-off-by: Nicolas Guidotti <224634272+nguidotti@users.noreply.github.com>

fixed compilation

a517f13

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

added missing header

f31599c

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

fixed guard against no incumbent when calling guided diving

202738f

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

addressing code rabbit comments. replaced AT in pseudo_costs_t with a…

4aed76c

… shared_ptr to avoid unnecessary copy. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

missing dereference

a5c111d

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'main' into simplify-pseudocost

919e445

split best-first and diving worker into separated objects

76ce1bb

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

increase the wheel size limit

c433e41

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fixed rng offset

52db538

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

increasing wheel size limit for CUDA 12

3676432

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

first version of the B&B workers with local heaps

d2f6eb7

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

implemented a lock-free stack to track the idle workers. fix potentia…

6a39187

…l crash in work-stealing Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fixed lower bound calculation at end of the B&B. reverted to locking …

dec671c

…queue for now. refactoring. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

correctly handles the node in the stack when the solver stops if they…

1b3a282

… are present Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

added atomic in node queue to track size and lower bound without a lock.

e108a54

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

replaced std::deque with a circular buffer.

315aca6

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge remote-tracking branch 'upstream/main' into rmm-cccl-migration

536a692

# Conflicts: # cpp/src/utilities/cuda_helpers.cuh

Inline upstream memory resource variable in test fixture MR composition

31a6eab

Replace deprecated rmm::mr set_*_resource_ref calls with set_*_resource

f889d28

renamed method

3469026

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'main' into simplify-pseudocost

8e8c794

# Conflicts: # ci/validate_wheel.sh

Merge branch 'main' into simplify-pseudocost

3e6aa83

merging with main branch

e0444c2

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fixed compilation

f3e863f

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge remote-tracking branch 'upstream/main' into rmm-cccl-migration

76c9ece

fixed small bugs

56bf9ed

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

added cleanup routine for the diving heap

18e1e83

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti requested review from chris-maes and hlinsen April 28, 2026 08:17

Merge branch 'main' into bnb-local-heap

a67f644

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

nguidotti added 3 commits April 28, 2026 10:30

cleaning code

93e8347

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

each mip node now stores vstatus in a compressed format to save memor…

7df7aa1

…y space Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fixed assert

2cd842a

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti added 2 commits April 28, 2026 14:55

fixed NONBASIC_FIXED encoding

226968d

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'main' into bnb-local-heap

56c8cff

nguidotti requested review from aliceb-nv and removed request for hlinsen April 28, 2026 12:56

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

fix missing decompression for deterministic diving

54d3f66

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti changed the base branch from main to pull-request/1120 April 28, 2026 15:00

nguidotti changed the base branch from pull-request/1120 to main April 28, 2026 15:04

nguidotti added 9 commits April 28, 2026 18:35

Merge branch 'main' into simplify-pseudocost

488ed8a

addresses reviewer's comments

14abf17

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'main' into simplify-pseudocost

fce0415

benchmarking the impact of atomics and mutexes in the deterministic mode

8b27e1b

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

revert changes to calculate_pseudocost

8a088f7

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

break down long expression for batch pdlp

5e21d9a

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

eliminated bnb mode in pseudocost. replace types into runtime checks

ec4514f

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

simplified code to only contain the path with atomics

34b6a40

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'simplify-pseudocost' into bnb-local-heap

2adec89

# Conflicts: # cpp/src/branch_and_bound/branch_and_bound.cpp # cpp/src/branch_and_bound/constants.hpp # cpp/src/branch_and_bound/pseudo_costs.hpp # cpp/src/utilities/omp_helpers.hpp

fixed compilation

bc3d50b

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

B&B with Local Heaps#1149

B&B with Local Heaps#1149
nguidotti wants to merge 71 commits intoNVIDIA:mainfrom
nguidotti:bnb-local-heap

nguidotti commented Apr 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Uh oh!

Uh oh!

nguidotti commented Apr 28, 2026

Uh oh!

nguidotti commented Apr 28, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

nguidotti commented Apr 28, 2026

Uh oh!

nguidotti commented May 5, 2026

Uh oh!

nguidotti commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		worker->set_inactive();
		bfs_worker_pool_.return_worker_to_pool(worker);

Conversation

nguidotti commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nguidotti commented Apr 28, 2026

Uh oh!

nguidotti commented Apr 28, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nguidotti commented Apr 28, 2026

Uh oh!

nguidotti commented May 5, 2026

Uh oh!

nguidotti commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nguidotti commented Apr 27, 2026 •

edited

Loading