Skip to content

KVM Host HA code improvements#13088

Merged
sureshanaparti merged 8 commits intoapache:4.22from
shapeblue:host-ha-code-improvements
May 8, 2026
Merged

KVM Host HA code improvements#13088
sureshanaparti merged 8 commits intoapache:4.22from
shapeblue:host-ha-code-improvements

Conversation

@sureshanaparti
Copy link
Copy Markdown
Contributor

@sureshanaparti sureshanaparti commented Apr 29, 2026

Description

This PR addresses the fix to not cancel VM HA items when Host HA is enabled & inspection in progress and improves the Host HA code (updates logs and some refactoring / cleanup).

When Host HA inspection in progress, the KVM investigor returns the Host Status as Up which cancels the VM HA items, don't cancel the VM HA items instead reschedule them to try again later.

This addresses #7543, #12922

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@sureshanaparti
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 11.11111% with 328 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.68%. Comparing base (ffebe8e) to head (646863f).
⚠️ Report is 9 commits behind head on 4.22.

Files with missing lines Patch % Lines
...ache/cloudstack/kvm/ha/KVMHostActivityChecker.java 0.00% 96 Missing ⚠️
...java/com/cloud/ha/HighAvailabilityManagerImpl.java 12.67% 60 Missing and 2 partials ⚠️
...n/java/org/apache/cloudstack/ha/HAManagerImpl.java 0.00% 36 Missing ⚠️
...om/cloud/hypervisor/kvm/resource/KVMHAMonitor.java 0.00% 29 Missing ⚠️
...vm/src/main/java/com/cloud/ha/KVMInvestigator.java 0.00% 16 Missing ⚠️
...oud/hypervisor/kvm/storage/LibvirtStoragePool.java 0.00% 16 Missing ⚠️
...java/org/apache/cloudstack/kvm/ha/KVMHAConfig.java 0.00% 10 Missing ⚠️
...oud/hypervisor/kvm/storage/LinstorStoragePool.java 0.00% 7 Missing ⚠️
...ud/hypervisor/kvm/storage/StorPoolStoragePool.java 0.00% 5 Missing ⚠️
...g/apache/cloudstack/ha/task/ActivityCheckTask.java 0.00% 5 Missing ⚠️
... and 23 more
Additional details and impacted files
@@            Coverage Diff            @@
##               4.22   #13088   +/-   ##
=========================================
  Coverage     17.67%   17.68%           
- Complexity    15789    15793    +4     
=========================================
  Files          5922     5922           
  Lines        533094   533119   +25     
  Branches      65210    65201    -9     
=========================================
+ Hits          94246    94259   +13     
- Misses       428208   428216    +8     
- Partials      10640    10644    +4     
Flag Coverage Δ
uitests 3.69% <ø> (ø)
unittests 18.75% <11.11%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17658

@sureshanaparti sureshanaparti changed the title Host HA code improvements KVM Host HA code improvements Apr 29, 2026
@sureshanaparti
Copy link
Copy Markdown
Contributor Author

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm and good cleanup (needs testing though)

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-15989)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 50948 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr13088-t15989-kvm-ol8.zip
Smoke tests completed. 149 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

… progress, and some code improvements

- When Host HA inspection in progress, the investigor returns the Host Status as Up which cancels the VM HA items
- Don't cancel the VM HA items, instead reschedule them to try again later
@sureshanaparti
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@sureshanaparti sureshanaparti marked this pull request as ready for review May 6, 2026 08:31
@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17729

…agent connection status to determine the Host HA inspection in progress or not, and some code improvements
@sureshanaparti
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17741

@sureshanaparti
Copy link
Copy Markdown
Contributor Author

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-16032)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 46549 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr13088-t16032-kvm-ol8.zip
Smoke tests completed. 149 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@weizhouapache weizhouapache added this to the 4.22.1 milestone May 7, 2026
@sureshanaparti
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17764

@sureshanaparti
Copy link
Copy Markdown
Contributor Author

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@sureshanaparti
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17779

@weizhouapache
Copy link
Copy Markdown
Member

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-16044)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 50147 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr13088-t16044-kvm-ol8.zip
Smoke tests completed. 149 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Copy link
Copy Markdown
Member

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Tested manually with the following global setting value

Image
  1. Create a HA enabled offering

  2. Deploy a vm with ha enabled offering

  3. Confiugure OOBM on the kvm host with either ipmi or redfish

  4. Enable host ha on the kvm host

  5. Trigger kernel panic on the host

echo c > /proc/sysrq-trigger

  1. Host Ha successfully gets triggered

  2. VM HA also kicks in

Logs

[root@ref-trl-11669-k-Mol8-kiran-chavala-mgmt1 ~]# cat   /var/log/cloudstack/management/management-server.log.2026-05-06 |grep -i "logid:9b1252df"
2026-05-06 17:29:42,814 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) Processing work HAWork[8-HA-7-Running-Investigating]
2026-05-06 17:29:42,816 DEBUG [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) RESTART with HA WORK
2026-05-06 17:29:42,819 DEBUG [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) Checking Host HA inspection is in progress or not for the host 1 from HAConfig, HA state is Fenced
2026-05-06 17:29:42,820 DEBUG [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) SimpleInvestigator unable to determine the state of the host.  Moving on.
2026-05-06 17:29:42,820 DEBUG [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) XenServerInvestigator unable to determine the state of the host.  Moving on.
2026-05-06 17:29:42,822 DEBUG [o.a.c.h.HAManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) HA: Agent [Host {"id":1,"name":"ref-trl-11669-k-Mol8-kiran-chavala-kvm1","type":"Routing","uuid":"fe5f72ef-5285-4d8d-91f7-c7a951ff2279"}] is fenced.
2026-05-06 17:29:42,823 DEBUG [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) KVMInvestigator was able to determine host Host {"id":1,"name":"ref-trl-11669-k-Mol8-kiran-chavala-kvm1","type":"Routing","uuid":"fe5f72ef-5285-4d8d-91f7-c7a951ff2279"} is in Down
2026-05-06 17:29:42,823 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) HA on VM instance {"id":7,"instanceName":"i-2-7-VM","state":"Running","type":"User","uuid":"eeacfde8-6019-477d-8551-952c4a2dc7cf"}
2026-05-06 17:29:42,825 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) Wait time setting on com.cloud.agent.api.CheckVirtualMachineCommand is 20 seconds
2026-05-06 17:29:42,827 DEBUG [c.c.h.CheckOnAgentInvestigator] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) Unable to reach the agent for VM instance {"id":7,"instanceName":"i-2-7-VM","state":"Running","type":"User","uuid":"eeacfde8-6019-477d-8551-952c4a2dc7cf"}: Resource [Host:1] is unreachable: Host 1: Host with specified id is not in the right state: Down
2026-05-06 17:29:42,827 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) SimpleInvestigator could not find VM instance {"id":7,"instanceName":"i-2-7-VM","state":"Running","type":"User","uuid":"eeacfde8-6019-477d-8551-952c4a2dc7cf"}
2026-05-06 17:29:42,827 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) XenServerInvestigator could not find VM instance {"id":7,"instanceName":"i-2-7-VM","state":"Running","type":"User","uuid":"eeacfde8-6019-477d-8551-952c4a2dc7cf"}
2026-05-06 17:29:42,829 DEBUG [o.a.c.h.HAManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) HA: Host [Host {"id":1,"name":"ref-trl-11669-k-Mol8-kiran-chavala-kvm1","type":"Routing","uuid":"fe5f72ef-5285-4d8d-91f7-c7a951ff2279"}] is fenced.
2026-05-06 17:29:42,829 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) KVMInvestigator found VM instance {"id":7,"instanceName":"i-2-7-VM","state":"Running","type":"User","uuid":"eeacfde8-6019-477d-8551-952c4a2dc7cf"} to be alive? false
2026-05-06 17:29:42,829 WARN  [o.a.c.f.j.AsyncJobExecutionContext] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) Job is executed without a context, setup psudo job for the executing thread
2026-05-06 17:29:42,843 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) Sync job-91 execution on object VmWorkJobQueue.7
2026-05-06 17:29:43,691 DEBUG [c.c.v.ClusteredVirtualMachineManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8, ctx-b87ca687]) (logid:9b1252df) start parameter value of enterHardwareSetup == null during processing of queued job
2026-05-06 17:29:43,698 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8, ctx-b87ca687]) (logid:9b1252df) Sync job-92 execution on object VmWorkJobQueue.7
2026-05-06 17:29:48,376 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) HA is now restarting VM instance {"id":7,"instanceName":"i-2-7-VM","state":"Running","type":"User","uuid":"eeacfde8-6019-477d-8551-952c4a2dc7cf"} on Host {"id":2,"name":"ref-trl-11669-k-Mol8-kiran-chavala-kvm2","type":"Routing","uuid":"c5bbfd66-800f-4634-ae73-46aa4d823868"}
2026-05-06 17:29:48,379 WARN  [c.c.a.AlertManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) alertType=[8] dataCenterId=[1] podId=[1] clusterId=[null] message=[HA starting VM: t5 (i-2-7-VM)].
2026-05-06 17:29:48,390 WARN  [c.c.a.AlertManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) No recipients set in global setting 'alert.email.addresses', skipping sending alert with subject [HA starting VM: t5 (i-2-7-VM)] and content [HA starting VM: t5 (i-2-7-VM)].
2026-0

@sureshanaparti sureshanaparti moved this from Todo to In Progress in Apache CloudStack 4.22.1 May 8, 2026
@sureshanaparti sureshanaparti merged commit 4359198 into apache:4.22 May 8, 2026
24 of 26 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Apache CloudStack 4.22.1 May 8, 2026
@sureshanaparti sureshanaparti deleted the host-ha-code-improvements branch May 8, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

KVM Host HA: host reaches Fenced but VMs remain Running on failed host and HA work is marked Done without restart

5 participants