Skip to content

B&B with Local Heaps#1149

Open
nguidotti wants to merge 71 commits intoNVIDIA:mainfrom
nguidotti:bnb-local-heap
Open

B&B with Local Heaps#1149
nguidotti wants to merge 71 commits intoNVIDIA:mainfrom
nguidotti:bnb-local-heap

Conversation

@nguidotti
Copy link
Copy Markdown
Contributor

@nguidotti nguidotti commented Apr 27, 2026

In this PR, each best-first worker has its own local node heap, such that it push/pop nodes without synchronizing with other workers. Each best-first worker periodically steals a node from a random worker to keep the node distribution more or less balance across them. Additionally, each best-first worker has a (fixed) set of diving worker assigned to it, which are used for performing diving on its own nodes whenever possible. This essentially eliminates the need of the scheduler thread, freeing one additional thread to do something useful.

Includes #1020.

MIPLIB results (GH200, 10min):

================================================================================
 main-100426-1 (1) vs bnb-local-heap (2)
================================================================================

------------------------------------------------------------------------------------------------------------------------------
|                                        |       Run 1        |       Run 2        |     Abs. Diff.     |   Rel. Diff. (%)   |
------------------------------------------------------------------------------------------------------------------------------
| Feasible                                                 227                  226                   -1                 --- |
| Optimal                                                   71                   77                   +6                 --- |
| Solutions with <0.1% primal gap                          124                  130                   +6                 --- |
| Nodes explored (mean)                              4.484e+06            1.316e+07           +8.675e+06                +193 |
| Nodes explored (shifted geomean)                        7395            1.202e+04                +4620               +62.5 |
| Relative MIP gap (mean)                               0.3361               0.3492             +0.01306               +3.89 |
| Relative MIP gap (shifted geomean)                     0.116               0.1146            -0.001384               -1.19 |
| Solve time (mean)                                      449.2                436.4                -12.8               -2.85 |
| Solve time (shifted geomean)                           231.9                212.8               -19.03               -8.21 |
| Primal gap (mean)                                      11.26                11.77              +0.5112               +4.54 |
| Primal gap (shifted geomean)                           0.602               0.5709             -0.03103               -5.15 |
| Primal integral (mean)                                 33.85                34.99               +1.142               +3.37 |
| Primal integral (shifted geomean)                      6.655                 6.89              +0.2345               +3.52 |
------------------------------------------------------------------------------------------------------------------------------

In summary, we explored ~3x nodes in average` at the same time frame. The number of optimal solutions also increased by 6.

Checklist

  • I am familiar with the Contributing Guidelines.
  • Testing
    • New or existing tests cover these changes
    • Added tests
    • Created an issue to follow-up
    • NA
  • Documentation
    • The documentation is up to date with these changes
    • Added new documentation
    • NA

bdice and others added 30 commits April 3, 2026 13:51
Remove dependency on rmm::mr::device_memory_resource base class. Resources
now satisfy the cuda::mr::resource concept directly.

- Replace shared_ptr<device_memory_resource> with value types and
  cuda::mr::any_resource<cuda::mr::device_accessible> for type-erased storage
- Replace set_current_device_resource(ptr) with set_current_device_resource_ref
- Replace set_per_device_resource(id, ptr) with set_per_device_resource_ref
- Remove make_owning_wrapper usage
- Remove dynamic_cast on memory resources (no common base class)
- Remove owning_wrapper.hpp and device_memory_resource.hpp includes
- Add missing thrust/iterator/transform_output_iterator.h include
  (no longer transitively included via CCCL)
…nd deterministic mode.

Signed-off-by: Nicolas Guidotti <224634272+nguidotti@users.noreply.github.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
… shared_ptr to avoid unnecessary copy.

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…l crash in work-stealing

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…queue for now. refactoring.

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
… are present

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts:
#	cpp/src/utilities/cuda_helpers.cuh
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts:
#	ci/validate_wheel.sh
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
@nguidotti nguidotti requested review from chris-maes and hlinsen April 28, 2026 08:17
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
cpp/src/branch_and_bound/branch_and_bound.cpp (1)

1687-1692: Decrement nodes_unexplored when a queued node is cut off.

This path prunes a node that was previously counted as unexplored, but the counter is never decremented here. That leaves the solver stats inconsistent and can skew the final summary.

Suggested fix
     if (upper_bound_.load() < start_node->lower_bound) {
       // This node was put on the heap earlier but its lower bound is now greater than the
       // current upper bound
       search_tree_.graphviz_node(settings_.log, start_node, "cutoff", start_node->lower_bound);
       search_tree_.update(start_node, node_status_t::FATHOMED);
+      --exploration_stats_.nodes_unexplored;
       continue;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/branch_and_bound/branch_and_bound.cpp` around lines 1687 - 1692, The
cutoff branch that checks if upper_bound_.load() < start_node->lower_bound
currently marks the node as FATHOMED but doesn't decrement the unexplored
counter; update that if block to decrement nodes_unexplored when a queued node
is pruned (i.e., right after search_tree_.update(start_node,
node_status_t::FATHOMED)), using the appropriate operation for the counter's
type (nodes_unexplored-- or nodes_unexplored.fetch_sub(1) if it's atomic) so
solver stats remain consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/branch_and_bound/branch_and_bound.cpp`:
- Around line 1717-1718: best_first_search_with() returns the bfs worker to
bfs_worker_pool_ immediately while active diving workers still call
worker->set_inactive() and decrement total_active_diving_workers, causing reuse
races; either defer calling bfs_worker_pool_.return_worker_to_pool(worker) until
that worker's total_active_diving_workers reaches zero (check the worker's
total_active_diving_workers/active_diving_worker_count before returning) or
implement a generation/tag on the BFS worker that diving workers validate before
mutating counters (increment generation on checkout and have set_inactive()
ignore updates if generation mismatches), and add the missing synchronization
(mutex/atomic) around shared counters (total_active_diving_workers) to prevent
races.

In `@cpp/src/branch_and_bound/worker.hpp`:
- Around line 163-189: After successfully stealing a node in
bfs_worker_t::steal_node_from (i.e., after this->node_queue.push(node)), update
the thief worker’s advertised lower_bound to reflect the stolen node so
worker_pool_t::get_lower_bound() sees it; specifically, while holding the
thief’s node_queue lock (the same lock used to push), set this->lower_bound =
std::min(this->lower_bound, <stolen node’s lower-bound field>) or otherwise
assign the node’s lower-bound value if it is smaller, ensuring thread-safety and
correctness of the global lower-bound computation.

---

Nitpick comments:
In `@cpp/src/branch_and_bound/branch_and_bound.cpp`:
- Around line 1687-1692: The cutoff branch that checks if upper_bound_.load() <
start_node->lower_bound currently marks the node as FATHOMED but doesn't
decrement the unexplored counter; update that if block to decrement
nodes_unexplored when a queued node is pruned (i.e., right after
search_tree_.update(start_node, node_status_t::FATHOMED)), using the appropriate
operation for the counter's type (nodes_unexplored-- or
nodes_unexplored.fetch_sub(1) if it's atomic) so solver stats remain consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 753ef90d-0840-45d4-aeee-cfb63a2093fa

📥 Commits

Reviewing files that changed from the base of the PR and between d936e05 and a67f644.

📒 Files selected for processing (5)
  • cpp/src/branch_and_bound/branch_and_bound.cpp
  • cpp/src/branch_and_bound/branch_and_bound.hpp
  • cpp/src/branch_and_bound/mip_node.hpp
  • cpp/src/branch_and_bound/worker.hpp
  • cpp/src/branch_and_bound/worker_pool.hpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/src/branch_and_bound/mip_node.hpp

Comment on lines +1717 to +1718
worker->set_inactive();
bfs_worker_pool_.return_worker_to_pool(worker);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don’t recycle a BFS worker while its diving workers still reference it.

best_first_search_with() returns the worker to bfs_worker_pool_ immediately, but cpp/src/branch_and_bound/worker.hpp:271-281 shows each active diving worker still decrements counters on that same bfs_worker in set_inactive(). If this object gets reused first, old dives will mutate the new assignment’s accounting and can underflow/assert on completion.

Hold the BFS worker out of the idle pool until total_active_diving_workers reaches zero, or make the diving-worker accounting generation-aware before reuse.

As per coding guidelines, flag missing synchronization for shared state in concurrent code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/branch_and_bound/branch_and_bound.cpp` around lines 1717 - 1718,
best_first_search_with() returns the bfs worker to bfs_worker_pool_ immediately
while active diving workers still call worker->set_inactive() and decrement
total_active_diving_workers, causing reuse races; either defer calling
bfs_worker_pool_.return_worker_to_pool(worker) until that worker's
total_active_diving_workers reaches zero (check the worker's
total_active_diving_workers/active_diving_worker_count before returning) or
implement a generation/tag on the BFS worker that diving workers validate before
mutating counters (increment generation on checkout and have set_inactive()
ignore updates if generation mismatches), and add the missing synchronization
(mutex/atomic) around shared counters (total_active_diving_workers) to prevent
races.

Comment thread cpp/src/branch_and_bound/worker.hpp
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…y space

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
@nguidotti
Copy link
Copy Markdown
Contributor Author

/ok to test 2cd842a

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
@nguidotti
Copy link
Copy Markdown
Contributor Author

/ok to test 56c8cff

@nguidotti nguidotti requested review from aliceb-nv and removed request for hlinsen April 28, 2026 12:56
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/src/branch_and_bound/pseudo_costs.cpp (1)

1350-1392: ⚠️ Potential issue | 🟠 Major

Initialize averages unconditionally to ensure correct fallback scoring.

averages is only assigned when reliable_threshold == 0, but calculate_pseudocost_score(..., averages) is called at lines 1403, 1519, 1693, and 1773 regardless of the threshold value. When reliable_threshold > 0, uninitialized variables with num_up == 0 or num_down == 0 fall back to averages.up_avg and averages.down_avg, which remain at their zero-initialized defaults (0.0) instead of the computed averages. This produces incorrect branch scores.

Suggested fix
-  pseudo_cost_averages_t<i_t, f_t> averages;
+  pseudo_cost_averages_t<i_t, f_t> averages = compute_averages();
...
-  if (reliable_threshold == 0) {
-    averages = compute_averages();
+  if (reliable_threshold == 0) {
     log.printf("PC: num initialized down %d up %d avg down %e up %e\n",
                averages.num_init_down,
                averages.num_init_up,
                averages.down_avg,
                averages.up_avg);

Also applies to: 1693, 1773

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/branch_and_bound/pseudo_costs.cpp` around lines 1350 - 1392, The
variable averages must be initialized unconditionally to avoid zero-default
fallbacks in calculate_pseudocost_score; call averages = compute_averages()
before computing reliable_threshold (or at least before any calls to
calculate_pseudocost_score) so averages.down_avg/up_avg and num_init_* are valid
when calculate_pseudocost_score is invoked (refer to pseudo_cost_averages_t and
compute_averages, and the calculate_pseudocost_score calls at lines around 1403,
1519, 1693, 1773). Keep the existing log.printf inside the reliable_threshold ==
0 branch (or guard it) but ensure averages is assigned regardless of the
threshold.
♻️ Duplicate comments (2)
cpp/src/branch_and_bound/branch_and_bound.cpp (1)

1718-1719: ⚠️ Potential issue | 🔴 Critical

Don’t recycle the BFS worker while diving tasks still hold its back-pointer.

launch_diving_worker() stores diving_worker->bfs_worker = bfs_worker and increments that BFS worker’s diving counters, but this function returns the BFS worker to the idle pool immediately after the best-first loop exits. If any dive is still running, it can report back into a recycled worker object and corrupt the new assignment’s accounting. Please keep the BFS worker out of the idle pool until total_active_diving_workers reaches zero, or make the dive completion path generation-aware.

As per coding guidelines, "Flag missing synchronization for shared state in concurrent code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/branch_and_bound/branch_and_bound.cpp` around lines 1718 - 1719, The
BFS worker is returned to the idle pool while active diving tasks still hold
pointers to it (launch_diving_worker sets diving_worker->bfs_worker and
increments counters), risking use-after-recycle; modify the lifecycle so the BFS
worker is not returned via bfs_worker_pool_.return_worker_to_pool(worker) (and
not marked inactive via worker->set_inactive()) until
total_active_diving_workers reaches zero, or alternatively change the diving
completion path to validate/generate-awarely clear diving_worker->bfs_worker and
decrement the BFS worker’s diving counters before any worker object can be
returned to the pool; update launch_diving_worker, the dive completion callback,
and the code around
worker->set_inactive()/bfs_worker_pool_.return_worker_to_pool to enforce this
ordering and synchronization.
cpp/src/branch_and_bound/worker.hpp (1)

184-186: ⚠️ Potential issue | 🟠 Major

Refresh lower_bound when a node is stolen.

After this->node_queue.push(node), this worker still advertises its old lower_bound. If global gap checks aggregate per-worker bounds, a cheaper stolen node stays invisible until this worker starts processing it.

Suggested fix
       this->node_queue.lock();
       this->node_queue.push(node);
+      this->lower_bound = std::min(this->lower_bound.load(), node->lower_bound);
       this->node_queue.unlock();

As per coding guidelines, "Flag algorithm correctness errors including logic errors in branch-and-bound implementations."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/branch_and_bound/worker.hpp` around lines 184 - 186, After pushing a
stolen node into this->node_queue (at the this->node_queue.push(node) site) the
worker still advertises an out-of-date this->lower_bound; update it immediately
so global gap checks see the cheaper bound. Fix: after pushing (and/or after
releasing the node_queue lock) recompute and publish the local lower bound by
invoking the same routine used elsewhere to derive the worker bound (e.g., call
the existing update_lower_bound()/recompute_lower_bound_from_queue() helper or
inline the same computation that sets this->lower_bound), ensuring any required
synchronization or atomics are used when writing this->lower_bound.
🧹 Nitpick comments (1)
cpp/src/branch_and_bound/pseudo_costs.cpp (1)

750-759: Document the warm-start PDLP tolerance choice.

pdlp_tolerance = 1e-5 drives all six stopping tolerances here, but the code does not explain why this is the right accuracy/cost tradeoff for strong branching warm starts. A brief rationale or a named setting would make later tuning much safer.

As per coding guidelines, "Flag missing documentation for numerical tolerances and algorithm parameters."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/branch_and_bound/pseudo_costs.cpp` around lines 750 - 759, Add a
short comment and/or named constant explaining why pdlp_tolerance = 1e-5 is
chosen for warm-start PDLP and that it trades accuracy for speed for
strong-branching warm starts: document that this tolerance is relaxed/tightened
to balance warm-start cost vs branching quality, mention any empirical basis
(e.g., "empirically found to preserve branching decisions while reducing solve
time") and the related setting name pdlp_tolerance which is applied to
ws_settings.tolerances (relative/absolute dual/primal/gap); alternatively pull
the literal into a clearly named constant (e.g., warm_start_pdlp_tolerance) with
the comment so future tuning is safer and traceable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/branch_and_bound/branch_and_bound.cpp`:
- Around line 3838-3850: Before calling dual_phase2_with_advanced_basis (and
similarly solve_linear_program_with_advanced_basis) for a deterministic dive,
restore the node’s packed basis into the worker’s vstatus: copy
node_ptr->packed_vstatus into worker.leaf_vstatus (or call the helper that
rebuilds vstatus from packed_vstatus) so the advanced basis matches the current
node; do this prior to using worker.leaf_vstatus in the LP call (references:
worker.leaf_vstatus, node_ptr->packed_vstatus, dual_phase2_with_advanced_basis,
solve_linear_program_with_advanced_basis).

In `@cpp/src/branch_and_bound/pseudo_costs.cpp`:
- Around line 784-789: The async device_uvector copies into
cache->initial_primal/initial_dual are performed on the warm-start stream
(ws_primal.stream()) but cache->populated is set immediately, causing races when
read later by batch_pdlp_reliability_branching_task using a different
batch_pdlp_handle stream; fix by synchronizing the source and consumer streams:
after creating the device_uvector copies (the code that sets
cache->initial_primal and cache->initial_dual from ws_primal/ws_dual and before
setting cache->populated) record an event on the warm-start stream (or call
stream.synchronize()) and make the batch_pdlp_handle stream wait on that event
(or synchronize) so readers in batch_pdlp_reliability_branching_task only see
populated=true after the copies are complete; apply the same pattern for both
warm-start places that set cache->populated and for the corresponding reader
path in batch_pdlp_reliability_branching_task.

In `@cpp/src/dual_simplex/initial_basis.cpp`:
- Around line 21-45: The current 2-bit packing in encode()/decode() is lossy (5
states need representation); fix by widening the packed representation to 3
bits: in encode(variable_status_t) assign unique 3-bit codes for BASIC,
NONBASIC_LOWER, NONBASIC_UPPER, NONBASIC_FIXED and NONBASIC_FREE (e.g. 0..4),
and in decode(uint8_t) mask with 0b111 and map those codes back to the exact
enum values; update any assertions to still reject SUPERBASIC and ensure both
functions use the same 3-bit scheme so packing/unpacking is lossless.
- Around line 76-85: The assert in decompress_vstatus is checking the wrong
value: it compares vstatus_size (number of unpacked statuses) against n +
has_tail (expected number of packed bytes). Update the assertion to validate the
input packed size by asserting that packed_vstatus.size() == n + has_tail so the
function verifies the number of packed bytes matches the computed count; keep
the rest of decompress_vstatus (variables n, has_tail, vstatus) unchanged.

---

Outside diff comments:
In `@cpp/src/branch_and_bound/pseudo_costs.cpp`:
- Around line 1350-1392: The variable averages must be initialized
unconditionally to avoid zero-default fallbacks in calculate_pseudocost_score;
call averages = compute_averages() before computing reliable_threshold (or at
least before any calls to calculate_pseudocost_score) so
averages.down_avg/up_avg and num_init_* are valid when
calculate_pseudocost_score is invoked (refer to pseudo_cost_averages_t and
compute_averages, and the calculate_pseudocost_score calls at lines around 1403,
1519, 1693, 1773). Keep the existing log.printf inside the reliable_threshold ==
0 branch (or guard it) but ensure averages is assigned regardless of the
threshold.

---

Duplicate comments:
In `@cpp/src/branch_and_bound/branch_and_bound.cpp`:
- Around line 1718-1719: The BFS worker is returned to the idle pool while
active diving tasks still hold pointers to it (launch_diving_worker sets
diving_worker->bfs_worker and increments counters), risking use-after-recycle;
modify the lifecycle so the BFS worker is not returned via
bfs_worker_pool_.return_worker_to_pool(worker) (and not marked inactive via
worker->set_inactive()) until total_active_diving_workers reaches zero, or
alternatively change the diving completion path to validate/generate-awarely
clear diving_worker->bfs_worker and decrement the BFS worker’s diving counters
before any worker object can be returned to the pool; update
launch_diving_worker, the dive completion callback, and the code around
worker->set_inactive()/bfs_worker_pool_.return_worker_to_pool to enforce this
ordering and synchronization.

In `@cpp/src/branch_and_bound/worker.hpp`:
- Around line 184-186: After pushing a stolen node into this->node_queue (at the
this->node_queue.push(node) site) the worker still advertises an out-of-date
this->lower_bound; update it immediately so global gap checks see the cheaper
bound. Fix: after pushing (and/or after releasing the node_queue lock) recompute
and publish the local lower bound by invoking the same routine used elsewhere to
derive the worker bound (e.g., call the existing
update_lower_bound()/recompute_lower_bound_from_queue() helper or inline the
same computation that sets this->lower_bound), ensuring any required
synchronization or atomics are used when writing this->lower_bound.

---

Nitpick comments:
In `@cpp/src/branch_and_bound/pseudo_costs.cpp`:
- Around line 750-759: Add a short comment and/or named constant explaining why
pdlp_tolerance = 1e-5 is chosen for warm-start PDLP and that it trades accuracy
for speed for strong-branching warm starts: document that this tolerance is
relaxed/tightened to balance warm-start cost vs branching quality, mention any
empirical basis (e.g., "empirically found to preserve branching decisions while
reducing solve time") and the related setting name pdlp_tolerance which is
applied to ws_settings.tolerances (relative/absolute dual/primal/gap);
alternatively pull the literal into a clearly named constant (e.g.,
warm_start_pdlp_tolerance) with the comment so future tuning is safer and
traceable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 753b46ac-59b0-4cf5-b3eb-a6f17a6e04e6

📥 Commits

Reviewing files that changed from the base of the PR and between a67f644 and 7df7aa1.

📒 Files selected for processing (10)
  • cpp/src/branch_and_bound/branch_and_bound.cpp
  • cpp/src/branch_and_bound/branch_and_bound.hpp
  • cpp/src/branch_and_bound/mip_node.hpp
  • cpp/src/branch_and_bound/node_queue.hpp
  • cpp/src/branch_and_bound/pseudo_costs.cpp
  • cpp/src/branch_and_bound/pseudo_costs.hpp
  • cpp/src/branch_and_bound/worker.hpp
  • cpp/src/dual_simplex/initial_basis.cpp
  • cpp/src/dual_simplex/initial_basis.hpp
  • cpp/src/utilities/omp_helpers.hpp
✅ Files skipped from review due to trivial changes (1)
  • cpp/src/dual_simplex/initial_basis.hpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/src/utilities/omp_helpers.hpp

Comment thread cpp/src/branch_and_bound/branch_and_bound.cpp
Comment on lines +784 to +789
cache->initial_primal = rmm::device_uvector<f_t>(ws_primal, ws_primal.stream());
cache->initial_dual = rmm::device_uvector<f_t>(ws_dual, ws_dual.stream());
cache->step_size = ws_solution.get_pdlp_warm_start_data().initial_step_size_;
cache->primal_weight = ws_solution.get_pdlp_warm_start_data().initial_primal_weight_;
cache->pdlp_iteration = ws_solution.get_pdlp_warm_start_data().total_pdlp_iterations_;
cache->populated = true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 103


🏁 Script executed:

sed -n '780,795p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 1003


🏁 Script executed:

sed -n '938,960p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 1195


🏁 Script executed:

sed -n '750,800p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 2698


🏁 Script executed:

sed -n '789,943p' cpp/src/branch_and_bound/pseudo_costs.cpp | head -100

Repository: NVIDIA/cuopt

Length of output: 3901


🏁 Script executed:

sed -n '820,900p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 3474


🏁 Script executed:

grep -n "cudaStreamSynchronize\|cuda_sync\|CUDA_SYNC\|raft::synchronize" cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

sed -n '770,800p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 1561


🏁 Script executed:

grep -n "get_primal_solution\|get_dual_solution" cpp/src/branch_and_bound/pseudo_costs.cpp | head -20

Repository: NVIDIA/cuopt

Length of output: 192


🏁 Script executed:

find . -name "*.hpp" -o -name "*.h" | xargs grep -l "get_primal_solution\|get_dual_solution" | head -5

Repository: NVIDIA/cuopt

Length of output: 377


🏁 Script executed:

grep -r "struct.*solution\|class.*solution" cpp/include --include="*.hpp" --include="*.h" | grep -i pdlp | head -10

Repository: NVIDIA/cuopt

Length of output: 760


🏁 Script executed:

sed -n '1,100p' cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp

Repository: NVIDIA/cuopt

Length of output: 4223


🏁 Script executed:

grep -A 20 "get_primal_solution\|get_dual_solution" cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp | head -50

Repository: NVIDIA/cuopt

Length of output: 1067


🏁 Script executed:

cat cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp | grep -A 5 "class optimization_problem_solution_t"

Repository: NVIDIA/cuopt

Length of output: 317


🏁 Script executed:

grep -B 5 -A 10 "ws_primal.stream()\|ws_dual.stream()" cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 1053


🏁 Script executed:

sed -n '758,775p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 920


🏁 Script executed:

grep -n "batch_pdlp_handle\|pdlp_warm_cache" cpp/src/branch_and_bound/pseudo_costs.cpp | head -30

Repository: NVIDIA/cuopt

Length of output: 2369


🏁 Script executed:

sed -n '930,960p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 1504


🏁 Script executed:

grep -B 10 "batch_pdlp_warm_cache_t" cpp/src/branch_and_bound/pseudo_costs.cpp | head -30

Repository: NVIDIA/cuopt

Length of output: 511


🏁 Script executed:

grep -n "struct batch_pdlp_warm_cache" cpp/include -r --include="*.hpp" --include="*.h"

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

find cpp -name "*.hpp" -o -name "*.h" | xargs grep -l "batch_pdlp_warm_cache" | head -5

Repository: NVIDIA/cuopt

Length of output: 98


🏁 Script executed:

grep -n "batch_pdlp_warm_cache_t" cpp/src/branch_and_bound/pseudo_costs.cpp | head -5

Repository: NVIDIA/cuopt

Length of output: 114


🏁 Script executed:

rg "struct batch_pdlp_warm_cache_t|class batch_pdlp_warm_cache_t" --type hpp --type h

Repository: NVIDIA/cuopt

Length of output: 84


🏁 Script executed:

cat cpp/src/branch_and_bound/pseudo_costs.hpp

Repository: NVIDIA/cuopt

Length of output: 12542


🏁 Script executed:

sed -n '789,943p' cpp/src/branch_and_bound/pseudo_costs.cpp | grep -i "sync\|event\|stream\|barrier"

Repository: NVIDIA/cuopt

Length of output: 226


🏁 Script executed:

sed -n '800,825p' cpp/src/branch_and_bound/pseudo_costs.cpp

Repository: NVIDIA/cuopt

Length of output: 890


Synchronize cache visibility before reads from different stream contexts.

Cache vectors are copied asynchronously on the warm-start handle's stream (lines 784-789), but the populated flag is set immediately without synchronization. Later, batch_pdlp_reliability_branching_task reads the cache through a different batch_pdlp_handle (lines 943-952) with no ordering guarantee. Since the handle's streams are independent, the async copy may not complete before the read, causing races. This applies to both code locations (lines 784-789 and 943-952).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/branch_and_bound/pseudo_costs.cpp` around lines 784 - 789, The async
device_uvector copies into cache->initial_primal/initial_dual are performed on
the warm-start stream (ws_primal.stream()) but cache->populated is set
immediately, causing races when read later by
batch_pdlp_reliability_branching_task using a different batch_pdlp_handle
stream; fix by synchronizing the source and consumer streams: after creating the
device_uvector copies (the code that sets cache->initial_primal and
cache->initial_dual from ws_primal/ws_dual and before setting cache->populated)
record an event on the warm-start stream (or call stream.synchronize()) and make
the batch_pdlp_handle stream wait on that event (or synchronize) so readers in
batch_pdlp_reliability_branching_task only see populated=true after the copies
are complete; apply the same pattern for both warm-start places that set
cache->populated and for the corresponding reader path in
batch_pdlp_reliability_branching_task.

Comment thread cpp/src/dual_simplex/initial_basis.cpp
Comment thread cpp/src/dual_simplex/initial_basis.cpp
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
@nguidotti
Copy link
Copy Markdown
Contributor Author

/ok to test 54d3f66

@nguidotti nguidotti changed the base branch from main to pull-request/1120 April 28, 2026 15:00
@nguidotti nguidotti changed the base branch from pull-request/1120 to main April 28, 2026 15:04
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts:
#	cpp/src/branch_and_bound/branch_and_bound.cpp
#	cpp/src/branch_and_bound/constants.hpp
#	cpp/src/branch_and_bound/pseudo_costs.hpp
#	cpp/src/utilities/omp_helpers.hpp
@nguidotti
Copy link
Copy Markdown
Contributor Author

/ok to test 2adec89

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
@nguidotti
Copy link
Copy Markdown
Contributor Author

/ok to test bc3d50b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality mip non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants