DAOS-18487 object: control EC rebuild resource consumption#17441
DAOS-18487 object: control EC rebuild resource consumption#17441gnailzenh merged 36 commits intorelease/2.6from
Conversation
A degraded EC read will allocate and register an extra buffer to recover data, which may cause ENOMEM in some cases. this workaround does not prevent dynamic buffer allocation and registration, it does provide relatively precise control over the resources consumed by degraded EC reads. Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
|
Errors are Unable to load ticket data |
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
|
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17441/2/testReport/ |
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
For data migration, after being waken up, the ULT should try to wake up another ULT if there is still available resource. Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17441/4/testReport/ |
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
9664eb4
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17441/5/testReport/ |
|
Test stage Functional Hardware Large completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17441/5/execution/node/1541/log |
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
- Add resource bucket so overall resource consumption wouldn't grow on system configured with more targets - Track demanded resource and waitq for blocked ULT, and wakeup as many waiters as resource(being released) allowed - Code cleanup Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
increase default resource limit Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Fix a reference leak in migrate_fini_one_ult() Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
- hulk data handling is not required anymore, it's replaced by starveling mechanism - remove the "yield" and simplify code Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
If a rebuild hang is detected, dump resource bucket information and the waiter queue head Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17441/22/testReport/ |
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
- remove private resource - add hulk data back, but as a separate resource type - other cleanups Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
|
Test stage Functional Hardware Medium Verbs Provider completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17441/30/testReport/ |
|
Test stage Functional Hardware Medium Verbs Provider completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17441/31/testReport/ |
|
This PR was merged with lint failures and now that failure is going to appear on every 2.6 PR and landing runs for 2.6 |
|
I pushed #17763 to fix |
A degraded EC read will allocate and register an extra buffer to recover data, which may cause ENOMEM in some cases.
this workaround does not prevent dynamic buffer allocation and registration, it does provide relatively precise control over the resources consumed by degraded EC reads.
Steps for the author:
After all prior steps are complete: