DAOS-18487 object: control EC rebuild resource consumption by gnailzenh · Pull Request #17439 · daos-stack/daos

gnailzenh · 2026-01-24T01:33:09Z

A degraded EC read will allocate and register an extra buffer to recover data, which may cause ENOMEM in some cases.

this workaround does not prevent dynamic buffer allocation and registration, it does provide relatively precise control over the resources consumed by degraded EC reads.

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

A degraded EC read will allocate and register an extra buffer to recover data, which may cause ENOMEM in some cases. this workaround does not prevent dynamic buffer allocation and registration, it does provide relatively precise control over the resources consumed by degraded EC reads. Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

github-actions · 2026-01-24T01:33:25Z

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-18487

liuxuezhao · 2026-01-24T02:40:13Z

src/object/srv_obj_migrate.c

+		 * registration, it does provide relatively precise control over the
+		 * resources consumed by degraded EC reads.
+		 */
+		data_size *= MIN(8, obj_ec_data_tgt_nr(&mrone->mo_oca));


See below L2052, the data_size pass to migrate_dkey(tls, mrone, data_size);
So the added size can define a new variable only pass to migrate_res_hold()/release(), to avoid affect migrate_dkey()?

And some fetch cases need not the data recovery process so will not allocate extra buffers, so maybe need not add so much size? as this may affect RB performance

daosbuild3 · 2026-01-24T03:54:08Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17439/1/testReport/

daosbuild3 · 2026-01-25T01:30:02Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17439/1/execution/node/1282/log

daosbuild3 · 2026-01-25T01:50:05Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17439/1/execution/node/1323/log

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

daosbuild3 · 2026-01-27T01:59:53Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17439/2/execution/node/1352/log

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

daosbuild3 · 2026-01-28T03:34:03Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17439/3/testReport/

For data migration, after being waken up, the ULT should try to wake up another ULT if there is still available resource. Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

daosbuild3 · 2026-01-30T02:49:14Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17439/4/execution/node/1392/log

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

daosbuild3 · 2026-02-03T16:01:16Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17439/5/testReport/

daosbuild3 · 2026-02-03T19:03:49Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17439/5/execution/node/1306/log

daosbuild3 · 2026-02-03T19:44:10Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17439/5/execution/node/1365/log

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

daosbuild3 · 2026-03-11T17:00:34Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17439/21/testReport/

daosbuild3 · 2026-03-12T03:00:11Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17439/21/execution/node/1358/log

- hulk data handling is not required anymore, it's replaced by starveling mechanism - remove the "yield" and simplify code Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

daosbuild3 · 2026-03-12T04:09:20Z

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17439/22/execution/node/304/log

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

daosbuild3 · 2026-03-12T04:53:43Z

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17439/23/execution/node/304/log

If a rebuild hang is detected, dump resource bucket information and the waiter queue head Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

gnailzenh · 2026-03-12T06:24:58Z

src/object/srv_obj_migrate.c

+		return 0;
+	}
+
+	if (!migr_res_is_private(res))


you can just take the mutex now, it's very cheap on private resource. I've made other parts of the PR consistent

gnailzenh · 2026-03-12T06:26:42Z

src/object/srv_obj_migrate.c

+	if (!migr_res_is_private(res))
+		ABT_mutex_unlock(res->res_mutex);
+
+	snprintf(buf + off, bufsz - off, " used/limit=%ld/%ld h=%d w=%d", used, limit, holders,


should we add resource name as well?

resource name is printed in the caller.

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

daosbuild3 · 2026-03-12T13:27:02Z

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17439/25/execution/node/1281/log

- remove private resource - add hulk data back, but as a separate resource type - other cleanups Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

gnailzenh · 2026-03-20T15:39:19Z

@wangshilong @NiuYawei @liuxuezhao @kccain this PR is ready for review

daltonbohning · 2026-03-23T16:25:57Z

This PR was merged with lint failures and now that failure is going to appear on every master PR and landing runs for master
https://github.com/daos-stack/daos/actions/runs/23185585865/job/67368150733?pr=17439

daltonbohning · 2026-03-23T18:42:48Z

This PR was merged with lint failures and now that failure is going to appear on every master PR and landing runs for master https://github.com/daos-stack/daos/actions/runs/23185585865/job/67368150733?pr=17439

I pushed #17762 to fix

liuxuezhao · 2026-03-12T08:58:55Z

src/object/srv_internal.h

 	/* migration init error */
 	int			mpt_init_err;
+
+	/* Watchdog: track progress to detect complete rebuild hang */


minor, "detect complete rebuild hang", complete can remove?

liuxuezhao · 2026-03-23T07:16:35Z

src/object/srv_obj_migrate.c

-	MIGR_KEY,
-	MIGR_DATA,
-	MIGR_MAX,
+	MIGR_HULK_INF_MIN = 0, /* disable hulk data */


looks like MIGR_HULK_INF_MIN is an INVALID setting, but it can be set

liuxuezhao · 2026-03-23T15:02:53Z

src/object/srv_obj_migrate.c

+		if (units == -1ULL) {
+			D_ASSERT(!uuid_is_null(pool_id));
+			if (uuid_compare(pool_id, waiter->rw_tls->mpt_pool_uuid) != 0)
+				continue;


seems possible that no any pool_uuid matched, for this case should not do below L2049 ~ 2051?

gnailzenh requested review from a team as code owners January 24, 2026 01:33

gnailzenh requested review from liuxuezhao and wangshilong January 24, 2026 01:33

liuxuezhao reviewed Jan 24, 2026

View reviewed changes

gnailzenh added 2 commits January 26, 2026 20:52

DAOS-18487 object: degraded buffer size only impact resource control

fc7efdc

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

Merge branch 'master' into b_ec_res

eecf6d3

wangshilong previously approved these changes Jan 26, 2026

View reviewed changes

liuxuezhao previously approved these changes Jan 27, 2026

View reviewed changes

DAOS-18487 object: amplify credits also for data from parity shard

cf0d064

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

gnailzenh dismissed stale reviews from liuxuezhao and wangshilong via cf0d064 January 28, 2026 02:41

DAOS-18487 object: try to wake up more ULTs

b000ff0

For data migration, after being waken up, the ULT should try to wake up another ULT if there is still available resource. Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

gnailzenh requested a review from NiuYawei January 29, 2026 08:26

NiuYawei previously approved these changes Jan 29, 2026

View reviewed changes

liuxuezhao previously approved these changes Jan 29, 2026

View reviewed changes

DAOS-18487 object: decrease upper limit of rebuild resource

9086fc3

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

gnailzenh dismissed stale reviews from liuxuezhao and NiuYawei via 9086fc3 February 3, 2026 13:12

wangshilong dismissed their stale review via d0f8632 March 11, 2026 09:15

DAOS-18487 rebuild: fix a bug for starveling

a6a5073

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

DAOS-18487 rebuild: code cleanup

e76123c

- hulk data handling is not required anymore, it's replaced by starveling mechanism - remove the "yield" and simplify code Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

DAOS-18487 rebuild: add a few assertions

047b71c

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

Add watchdog to detect rebuild hang

42bd6ed

If a rebuild hang is detected, dump resource bucket information and the waiter queue head Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

gnailzenh commented Mar 12, 2026

View reviewed changes

wangshilong and others added 2 commits March 12, 2026 14:39

codes cleanup

880aaae

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

DAOS-18487 rebuild: integer overflow

3bcd524

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

gnailzenh and others added 5 commits March 16, 2026 20:31

DAOS-18487 rebuild: code cleanup

7727cdb

- remove private resource - add hulk data back, but as a separate resource type - other cleanups Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

DAOS-18487 rebuild: remove the false assertion

1b61456

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

Fix to reset eventual if reused

f05a7ad

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

improve watchdog

b3c3aad

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

Merge branch 'master' of github.com:daos-stack/daos into liang/b_ec_res

aa0c0dc

gnailzenh requested a review from kccain March 17, 2026 14:09

wangshilong approved these changes Mar 21, 2026

View reviewed changes

NiuYawei approved these changes Mar 23, 2026

View reviewed changes

gnailzenh merged commit e2dab9f into master Mar 23, 2026
39 of 41 checks passed

gnailzenh deleted the liang/b_ec_res branch March 23, 2026 12:11

liuxuezhao reviewed Mar 24, 2026

View reviewed changes

liuxuezhao mentioned this pull request Mar 24, 2026

DAOS-18487 object: hulk data only consumes 1 unit #17768

Open

6 tasks

Conversation

gnailzenh commented Jan 24, 2026

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions bot commented Jan 24, 2026

Uh oh!

liuxuezhao Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daosbuild3 commented Jan 24, 2026

Uh oh!

daosbuild3 commented Jan 25, 2026

Uh oh!

daosbuild3 commented Jan 25, 2026

Uh oh!

daosbuild3 commented Jan 27, 2026

Uh oh!

daosbuild3 commented Jan 28, 2026

Uh oh!

daosbuild3 commented Jan 30, 2026

Uh oh!

daosbuild3 commented Feb 3, 2026

Uh oh!

daosbuild3 commented Feb 3, 2026

Uh oh!

daosbuild3 commented Feb 3, 2026

Uh oh!

daosbuild3 commented Mar 11, 2026

Uh oh!

daosbuild3 commented Mar 12, 2026

Uh oh!

daosbuild3 commented Mar 12, 2026

Uh oh!

daosbuild3 commented Mar 12, 2026

Uh oh!

gnailzenh Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

wangshilong Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

gnailzenh Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

wangshilong Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

daosbuild3 commented Mar 12, 2026

Uh oh!

gnailzenh commented Mar 20, 2026

Uh oh!

Uh oh!

daltonbohning commented Mar 23, 2026

Uh oh!

daltonbohning commented Mar 23, 2026

Uh oh!

liuxuezhao Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

liuxuezhao Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

liuxuezhao Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

liuxuezhao Jan 24, 2026 •

edited

Loading