MGM: fix WeightedRandomPlacement::access selectedIndex and division by zero-weight disk possible by bodoque-01 · Pull Request #57 · cern-eos/eos

bodoque-01 · 2026-05-20T17:42:43Z

Fixes two bugs in WeightedRandomPlacement::access() that prevent it from being safely configured as a per-space access strategy.

Bug 1 `selectedIndex` semantics

access() stored the chosen fsid in args.selectedIndex and then validated it with best_index <= args.selectedfs.size(). The downstream consumer (mgm/ofs/XrdMgmOfsFile.cc:473-523) uses selectedIndex as a
position into arrays (firewalleps[fsIndex], proxys[fsIndex]) and iterated them in parallel.

The mismatch produces two failure modes:

Spurious ENOENT whenever the chosen fsid exceeds the replica list size (as far as I know should be the typical case?), weighted-random access silently fails.
Out-of-bounds read on the match case (small fsid in a small cluster), since the guard also uses <= where the position must be strictly less than the size, the same thing happens

Fix: iterate by index, store the loop position, and use std::numeric_limits<size_t>::max() as the "nothing picked" sentinel.

Bug 2 division by zero on zero-weight disks

score = hashFid(args.inode, fsid) / wt had no guard against wt == 0. Disk weight is std::atomic<uint8_t> default-initialised to 0 (mgm/placement/ClusterDataTypes.hh:66). validDiskPlct does not filter on weight (mgm/placement/PlacementStrategy.hh:318-339). A file with a replica on a zero-weight disk would SIGFPE the MGM.

Fix: skip zero-weight disks before the division. Treating weight 0 as "do not select" is consistent with how populateWeights already handles it for placement (discrete_distribution assigns it zero probability).

Regarding testing, I added a new case that covers both fixes, both that selectedIndex is a position and that a
zero-weight disk is considered skippable. Too, that a list of only zero-weight disks (don't know how extreme that case would be) returns ENOENT. I ran the tests on the prebuild-el9-diopside image from CERN, running on my Ubuntu as the Host.

…t crash WeightedRandomPlacement::access() had two bugs that made it unsafe to use as a per-space access strategy: 1. selectedIndex was set to the chosen fsid instead of its position in args.selectedfs. The downstream consumer (XrdMgmOfsFile.cc) uses it to INDEX parallel arrays (firewalleps, proxys) sized like selectedfs, matching the contract set by RoundRobinPlacement::access. The guard `best_index <= args.selectedfs.size()` then rejected pretty much every selection as ENOENT, and on the rare match returned an out-of-bounds index. In the fix, we track the loop position instead and use a max() sentinel for the "nothing picked" path. 2. score = hashFid(...) / wt divided by the disk weight with no guard. Disk weight is atomic<uint8_t> default-initialised to 0 (ClusterDataTypes.hh), settable to 0 via `eos sched weight ... 0` (SchedCmd.cc), and validDiskPlct does not filter on weight. In the fix, we skip zero-weight disks before the division. Adds FlatScheduler.WeightedRandomAccessSelectedIndex covering both fixes (position semantics, zero-weight skip, only-zero-weight returns ENOENT) and tightens ASSERT_LE -> ASSERT_LT in TLSingleSiteWeighted. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MGM: fix WeightedRandomPlacement::access selectedIndex and division by zero-weight disk possible#57

MGM: fix WeightedRandomPlacement::access selectedIndex and division by zero-weight disk possible#57
bodoque-01 wants to merge 1 commit into
cern-eos:masterfrom
bodoque-01:fix/mgm-weighted-random-access

bodoque-01 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bodoque-01 commented May 20, 2026

Bug 1 selectedIndex semantics

Bug 2 division by zero on zero-weight disks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 1 `selectedIndex` semantics