Skip to content

DAOS-18487 object: control EC rebuild resource consumption#17441

Merged
gnailzenh merged 36 commits intorelease/2.6from
liang/b2_6_ec_res
Mar 23, 2026
Merged

DAOS-18487 object: control EC rebuild resource consumption#17441
gnailzenh merged 36 commits intorelease/2.6from
liang/b2_6_ec_res

Conversation

@gnailzenh
Copy link
Contributor

A degraded EC read will allocate and register an extra buffer to recover data, which may cause ENOMEM in some cases.

this workaround does not prevent dynamic buffer allocation and registration, it does provide relatively precise control over the resources consumed by degraded EC reads.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

A degraded EC read will allocate and register an extra buffer
to recover data, which may cause ENOMEM in some cases.

this workaround does not prevent dynamic buffer allocation and
registration, it does provide relatively precise control over the
resources consumed by degraded EC reads.

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
@gnailzenh gnailzenh requested review from a team as code owners January 24, 2026 03:00
@github-actions
Copy link

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-18487

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
@daosbuild3
Copy link
Collaborator

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
For data migration, after being waken up, the ULT should try
to wake up another ULT if there is still available resource.

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
wangshilong
wangshilong previously approved these changes Jan 29, 2026
@daosbuild3
Copy link
Collaborator

NiuYawei
NiuYawei previously approved these changes Feb 2, 2026
liuxuezhao
liuxuezhao previously approved these changes Feb 2, 2026
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
@gnailzenh gnailzenh dismissed stale reviews from liuxuezhao, NiuYawei, and wangshilong via 9664eb4 February 3, 2026 12:50
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

gnailzenh and others added 8 commits February 6, 2026 16:54
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
- Add resource bucket so overall resource consumption wouldn't
  grow on system configured with more targets
- Track demanded resource and waitq for blocked ULT, and wakeup
  as many waiters as resource(being released) allowed
- Code cleanup

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
increase default resource limit

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
gnailzenh and others added 14 commits February 25, 2026 22:58
Fix a reference leak in migrate_fini_one_ult()

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
- hulk data handling is not required anymore, it's replaced by
  starveling mechanism
- remove the "yield" and simplify code

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
If a rebuild hang is detected, dump resource bucket information and the waiter queue head

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@daosbuild3
Copy link
Collaborator

gnailzenh and others added 6 commits March 12, 2026 21:31
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
- remove private resource
- add hulk data back, but as a separate resource type
- other cleanups

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17441/30/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17441/31/testReport/

@gnailzenh gnailzenh merged commit f65a4e4 into release/2.6 Mar 23, 2026
41 of 44 checks passed
@gnailzenh gnailzenh deleted the liang/b2_6_ec_res branch March 23, 2026 12:12
@daltonbohning
Copy link
Contributor

This PR was merged with lint failures and now that failure is going to appear on every 2.6 PR and landing runs for 2.6
https://github.com/daos-stack/daos/actions/runs/23380992086/job/68020301407?pr=17441

@daltonbohning
Copy link
Contributor

I pushed #17763 to fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

7 participants