Skip to content

UCC/CORE: Added local rank from topo if not provided by user#1245

Open
MaayanGadishNvidia wants to merge 1 commit intoopenucx:masterfrom
MaayanGadishNvidia:NIC_bid_auto
Open

UCC/CORE: Added local rank from topo if not provided by user#1245
MaayanGadishNvidia wants to merge 1 commit intoopenucx:masterfrom
MaayanGadishNvidia:NIC_bid_auto

Conversation

@MaayanGadishNvidia
Copy link
Copy Markdown

What

Adding automation for calculate local rank from topo if not provided by user

Why ?

Continue of #1189

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 22, 2026

Greptile Summary

This PR adds automatic local rank detection when NODE_LOCAL_ID is set to auto and OOB is available. It introduces a new ucc_core_ctx_id_exchange() function that performs a lightweight prefix allgather (exchanging only ctx_id + host_info, i.e., offsetof(ucc_context_addr_header_t, n_components) bytes per rank) before the main address exchange. The result feeds ucc_context_topo_init() and ucc_topo_init() to compute the per-rank node-local index, which is then passed to TL contexts and ultimately to UCX.

Key improvements over previous iterations:

  • ctx->id is now initialised once (lines 772–773) before both exchanges, removing the earlier double-initialisation bug.
  • The topo_required && !ctx->topo guard at line 962 prevents double topo allocation.
  • Error paths at error_ctx now call ucc_context_topo_cleanup(ctx->topo) and free addr_storage.storage, plugging previously reported memory leaks.
  • The prefix exchange allocates (n+1) entries—slot n is the local scratch buffer—and sets addr_len = ctx_addr_prefix_len, making the storage layout compatible with ucc_context_topo_init() which only reads h->ctx_id.pi and h->host_info, both within that prefix.

Remaining concern: ucc_topo_node_local_rank() returns UCC_RANK_INVALID (UINT32_MAX) when UCC_SBGP_NODE is not enabled (e.g., all ranks on separate nodes in a multi-rank OOB scenario). This value is assigned to b_params.node_local_id without a validity check. The TL UCP guard is params->node_local_id != UCC_ULUNITS_AUTO; since UCC_RANK_INVALID ≠ UCC_ULUNITS_AUTO, UINT32_MAX would be forwarded to UCX as the node-local ID. Callers should validate the return value and fall back to 0 for single-rank-per-node scenarios.

Confidence Score: 4/5

Safe to merge for multi-rank, multi-node topologies with ppn>1; single-rank-per-node in multi-rank OOB may pass UINT32_MAX to UCX as node_local_id

Most previously raised P1 concerns (double ctx->id init, topo double-init, error-path leaks) appear resolved in this iteration. One P1 remains: UCC_RANK_INVALID is propagated unchecked to UCX when UCC_SBGP_NODE is not enabled (all ranks on separate nodes with n_oob_eps > 1), since UCC_RANK_INVALID (UINT32_MAX) != UCC_ULUNITS_AUTO and the TL UCP guard does not filter it out.

src/components/topo/ucc_topo.h (ucc_topo_node_local_rank return value for disabled sbgp) and src/core/ucc_context.c line 845 (missing validation of UCC_RANK_INVALID before assigning to b_params.node_local_id)

Important Files Changed

Filename Overview
src/components/topo/ucc_topo.h Adds ucc_topo_node_local_rank() helper; returns UCC_RANK_INVALID when node sbgp is not enabled, which propagates unchecked to UCX in single-rank-per-node multi-rank OOB scenarios
src/core/ucc_context.c Adds ucc_core_ctx_id_exchange() for a prefix-only allgather (ctx_id + host_info); early exchange cleanup, double topo-init guard, and error-path topo/storage cleanup are all handled correctly
src/core/ucc_context.h Declares ucc_core_ctx_id_exchange; updates NODE_LOCAL_ID doc string to document the new three-allgather contract required of OOB backends

Greploops — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.
Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

Reviews (54): Last reviewed commit: "UCC/CORE: Added automation to loading lo..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c Outdated
Comment thread src/components/topo/ucc_topo.h Outdated
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c Outdated
Comment thread src/components/topo/ucc_topo.h Outdated
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c Outdated
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c Outdated
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c Outdated
Comment thread src/core/ucc_context.c Outdated
Comment thread src/components/tl/ucp/tl_ucp_context.c Outdated
Comment thread src/core/ucc_context.c Outdated
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c Outdated
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c Outdated
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c Outdated
Comment thread src/core/ucc_context.c Outdated
Comment thread src/core/ucc_context.c Outdated
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c
Comment thread src/core/ucc_context.c Outdated
Comment thread src/core/ucc_context.c Outdated
ucp_params.estimated_num_eps = params->estimated_num_eps;
}

#ifdef HAVE_UCX_NODE_LOCAL_ID
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace added.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c Outdated
Comment thread src/core/ucc_context.c Outdated
Comment thread src/core/ucc_context.c
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c Outdated
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c Outdated
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c
Comment thread src/core/ucc_context.c
Comment thread src/core/ucc_context.c Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 29, 2026

Additional Comments (1)

src/core/ucc_context.c
Added components/topo/ucc_topo.h include to support automatic local rank computation from topology

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c Outdated
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/core/ucc_context.c Outdated
@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Feb 12, 2026

/build

Comment thread src/core/ucc_context.c
Comment thread src/core/ucc_context.c
Comment thread src/core/ucc_context.c
Comment thread src/core/ucc_context.c
Comment thread src/core/ucc_context.c
@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Mar 20, 2026

/build

@MaayanGadishNvidia MaayanGadishNvidia force-pushed the NIC_bid_auto branch 12 times, most recently from 5b38353 to bec53ca Compare March 30, 2026 15:30
@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Mar 30, 2026

/build

Comment thread src/components/topo/ucc_topo.h Outdated
ucc_sbgp_t *sbgp;

if (!topo) {
return 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it makes sense to return rank_invalid in this case

Comment thread src/components/topo/ucc_topo.h Outdated
sbgp = ucc_topo_get_sbgp(topo, UCC_SBGP_NODE);

if (sbgp->status != UCC_SBGP_ENABLED) {
return 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invalid rank

Comment thread tools/perf/ucc_pt_comm.cc
cfg_mod = std::to_string(bootstrap->get_ppn());
UCCCHECK_GOTO(ucc_context_config_modify(ctx_config, NULL,
"ESTIMATED_NUM_PPN", cfg_mod.c_str()), free_ctx_config, st);
cfg_mod = std::to_string(bootstrap->get_local_rank());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it for validation of my function. Ill return it

Comment thread src/core/ucc_context.c Outdated
return status;
}


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rem extra blank line

Comment thread src/core/ucc_context.c
do {
/* UCC context create is blocking fn, so we can wait here for the
completion of addr exchange */
status = ucc_core_ctx_id_exchange(ctx, &ctx->params.oob,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second oob allgather does host info exchange as well, which is not always necessary since we already have topo info

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently clean topo after the local id clac. Do you think I should create another addr exchange function which complete the rest of the original addr exchange, in case that the ucc_core_ctx_id_exchange was used?

@MaayanGadishNvidia MaayanGadishNvidia force-pushed the NIC_bid_auto branch 3 times, most recently from 4a52d46 to e05830c Compare April 1, 2026 12:25
@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Apr 6, 2026

/build

@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Apr 9, 2026

@MaayanGadishNvidia let's rework so that init tim doesn't increase, please run osu_mpi_init tests to make sure we don't impact init

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants