Skip to content

[ZCF-2747] strip SDN controller inventory endpoint port#4029

Open
ZStack-Robot wants to merge 117 commits into
feature-5.5.22-zcf-temporaryfrom
sync/shixin.ruan/shixin-ZCF-2747@@2
Open

[ZCF-2747] strip SDN controller inventory endpoint port#4029
ZStack-Robot wants to merge 117 commits into
feature-5.5.22-zcf-temporaryfrom
sync/shixin.ruan/shixin-ZCF-2747@@2

Conversation

@ZStack-Robot
Copy link
Copy Markdown
Collaborator

ZCF-2747\n\n- Return SDN controller inventory ip as host address when DB stores host:port endpoint.\n- Keeps persisted controller endpoint unchanged for backend use.\n\nVerification:\n- git diff --check\n- mvn -Ppremium -pl :plugin-premium,:test-premium,:header -am -DfailIfNoTests=false -Dtest=TestZnsVmNicLifeCycleCase,TestZnsControllerLifeCycleCase test: reactor compile passed; TestZnsControllerLifeCycleCase passed; TestZnsVmNicLifeCycleCase blocked by local test environment startup/JMX after prior fork.

sync from gitlab !9925

shan.wu and others added 30 commits April 7, 2026 15:56
…emetal2 instance

support attaching novlan and vxlan network to baremetal2 instance

Resolves/Related: ZSTAC-82781

Change-Id: I736d637a7168656a6c726c6769777a726e616974

(cherry picked from commit 2499826)
Add virtiofs-based model mounting capability for user VMs:
- Add VmModelMountVO database table and schema with hostUuid tracking
- Add AttachVirtiofsCmd/DetachVirtiofsCmd for KVM agent
- Add MountModelCenterCmd with storageUrl field
- Add SDK actions for mount/unmount/query operations
- Add error codes 10138-10149

Resolves: ZSTAC-83157
Change-Id: I746679736f7a7176646e646d797969766f697a76
Resolves: ZSTAC-83157

Change-Id: I62696a6d667468766a6575656763707374757277
- add dedicated container image pull error code mapping

Resolves: ZSTAC-84175

Change-Id: Iabcdef1234567890abcdef1234567890abcdef12
Resolves: ZSTAC-83157

Change-Id: I756c7073707a6468676c70696a7a746870767777
<fix>[conf]: update global-error

See merge request zstackio/zstack!9557
Revert the flat userdata hostname fallback so VM creation and
reboot do not inject derived hostname metadata into userdata
payloads when no hostname is configured.

Add CreateRebootVmKeepUserdataContentCase to capture flat
userdata apply requests during VM create and reboot, and assert
the userdata list remains unchanged while hostname metadata stays
absent.

Resolves: ZSTAC-84529

Change-Id: I6f6e6b6b796e6e7a73727673626b7569696a6172
Expose requestCpu/requestMemory on the generated PodInventory so
SDK callers can deserialize the fields returned by ZQL.

Upgrade script adds matching columns on PodVO and backfills
existing pods with request=limit for legacy-equivalent behaviour.

Resolves: ZSTAC-80103

Change-Id: I5b2a9d4e8c6f3b7a1d0e9c5b4a8f2d7e6c1b3a9f
<feature>[dpu-bm2]: support attaching novlan and vxlan network to baremetal2 instance

See merge request zstackio/zstack!9530
fix(network): keep userdata content immutable

See merge request zstackio/zstack!9686
… redeploy

Resolves: ZSTAC-84446

Change-Id: I8af5e3887a5bad286b43dda00c874c9de999e1cb
<fix>[kvm]: unify TLS cert IP collection to avoid reconnect-triggered redeploy

See merge request zstackio/zstack!9683
Resolves: ZSTAC-84751

Change-Id: I84751b1c2d3e4f5a678901234567890abcdef1235
<fix>[aios]: add i18n mappings for fatal image pull error

See merge request zstackio/zstack!9670
<fix>[sdk,db]: add PodInventory request fields

See merge request zstackio/zstack!9682
…0165

Resolves: ZSTAC-84677
Related: ZSTAC-80103

Two overlapping problems are addressed in the ai module's global error
codes; both root from the same author reusing codes that were already
allocated for other semantics.

(1) ZSTAC-84677: AIModelManagerImpl model-service startup-timeout path
    passed the string literal "ORG_ZSTACK_AI_10144" to operr(). That
    code is owned by VmModelMountManager (mount path conflict, 5 %s
    template); the timeout only supplies 3 formatArgs, so the UI
    rendered a garbled "挂载失败:路径..." message for a health-check
    timeout.

(2) ZSTAC-80103: error codes 10138-10143 had two conflicting owners:
    - AIModelManagerApiInterceptor: CPU / memory request validation
      (this is the canonical semantics per the AI error-code spec)
    - VmModelMountManager: VM/model not-found, VM-state checks,
      cross-account mount, ModelCenter not-found
    The mount feature wrote zh_CN i18n templates with mount semantics,
    so when the CPU/memory validation actually fired, users saw the
    unrelated mount text with %s slots filled by CPU/memory args
    (e.g. "虚拟机「1」(UUID: 4) 必须处于运行状态才能挂载模型,当前状态: 4"
    for a CPU-below-min error).

This commit (zstack side):
- Add 7 new error-code constants in CloudOperationsErrorCode:
  10159 (model service startup timeout)
  10160-10165 (relocated mount-feature errors that previously squatted
  on 10138-10143)
- Rewrite globalErrorCodeMapping entries for 10138-10143 to the
  canonical CPU/memory validation semantics
- Add localized templates for 10159-10165 across all 10 locales
  (zh_CN, en_US, zh_TW, ja-JP, ko-KR, de-DE, fr-FR, ru-RU, id-ID,
  th-TH) so non-Chinese users see localized messages

The companion premium MR switches:
- AIModelManagerImpl startup-timeout call site to 10159
- VmModelMountManager 6 call sites from 10138-10143 to 10160-10165

Change-Id: I076f8da0b935a47cd4956e039e1e0b05206d085e
<fix>[sdk]: sync OSPF neighbor adjacency fields

See merge request zstackio/zstack!9737
Resolves: ZSTAC-84446

Change-Id: I9bed31c0cefddd6ed11f59cd13e36eb1c2abc029
<fix>[conf]: add ORG_ZSTACK_AI_10159 for model service startup timeout

See merge request zstackio/zstack!9734
<fix>[kvm]: decouple TLS cert detection from libvirtd restart toggle

See merge request zstackio/zstack!9741
PortForwardingConfigProxy.detachNetworkServiceFromNoHaVirtualRouter
would generate 'AND vo.uuid IN ()' when called with an empty list,
causing a MariaDB syntax error. Add an early return guard.

Resolves: ZSTAC-84748

Change-Id: I15a86ffb92829af6952697a4f3cb1351cd5ed28c
<fix>[portforwarding]: skip detach when serviceUuids is empty

See merge request zstackio/zstack!9744
… marshal time

Per ZSTAC-84817, malformed JSON in APIAddExternalPrimaryStorage.config slipped
past the API boundary and surfaced as SYS.1000 InternalError from a deep Gson
parse, after the message had already reached the cloud-bus handler.

Fix:

* PrimaryStorageApiInterceptor now validates msg.getConfig() with org.json
  (strict — rejects unquoted keys / missing colons that Gson lenient mode
  silently tolerates) and throws ApiMessageInterceptionException with
  ORG_ZSTACK_STORAGE_PRIMARY_10053 if it isn't a valid JSON object.
* conf/serviceConfig/externalPrimaryStorage.xml wires
  PrimaryStorageApiInterceptor at service level so the interceptor actually
  runs for APIAddExternalPrimaryStorageMsg (the existing wiring in
  primaryStorage.xml does not cover messages declared in this file).
* CloudOperationsErrorCode.ORG_ZSTACK_STORAGE_PRIMARY_10053 constant added.

Test: ZbsPrimaryStorageCase gains
testAddExternalPrimaryStorageWithMalformedJsonRejectedByInterceptor — sends
"{this is not valid json" via API, asserts the call fails AND no
PrimaryStorageVO/ExternalPrimaryStorageVO/PrimaryStorageOutputProtocolRefVO
row leaks. Verified locally: Tests run: 1, Failures: 0, Errors: 0,
Time elapsed: 92.788 sec, BUILD SUCCESS.

Resolves: ZSTAC-84817

Change-Id: Icf6c648133d7866edf35940d56a28f74f4c64817
Resolves: ZSTAC-84839

Change-Id: I7a756e72627a6c67736a72656669677667747067
<fix>[primaryStorage]: validate ExternalPrimaryStorage config JSON at marshal time

See merge request zstackio/zstack!9743
<fix>[allocator]: thread HostAllocationPurpose through allocator path

See merge request zstackio/zstack!9754
br_conn_all_ns (169.254.64.1, mevoco.py CONNECT_ALL_NETNS_BR_OUTER_IP) is
created on the host only when the first flat L3 network lands, AFTER the
add-host ansible playbook has generated the TLS cert. The cert SAN
therefore lacks this IP, and on every subsequent reconnect the
check-tls-certs flow detects the gap and triggers an ansible redeploy
plus libvirtd/kvmagent restart, breaking PID stability.

This IP is host-internal (used only for VM <-> root-netns userdata /
pushgateway / kvmagent HTTP) and has no business in a libvirtd TLS cert.
Add EXCLUDED_INTERNAL_IPS as a single source of truth used by both the
check flow (buildIpList) and the deploy flow (unionTlsCertIps) so the two
flows can never disagree, regardless of when the bridge appears.
host_plugin.fact() on the agent side is unchanged.

Resolves: ZSTAC-84446

Change-Id: I10821d8c68190f2cbc8a0679d19ed053916d9184
<fix>[kvm]: exclude br_conn_all_ns IP from TLS cert SAN

See merge request zstackio/zstack!9758
DBImpact

Change-Id: I6d757a76676d6763697161776a61796175737775
…erviceRefVO.isDefault/createDate/lastOpDate

Resolves: ZSTAC-84025

Change-Id: Ifbb68c9d475aa19762bb053e6545381a305dfe5b
shixin.ruan and others added 22 commits May 18, 2026 16:57
Resolves: ZCF-3703

Change-Id: I8b002e1c0b7e0535df0803a8107e9b4ebbdadf21
<fix>[vm]: ZSTAC-84158 handle StopVmGC host mismatch

See merge request zstackio/zstack!9888
KVM host reconnect can still fail with `Defaults requiretty` because
the Java control plane runs some sudo commands through direct SSH,
outside the Ansible path. Use a per-call PTY only for those commands
so existing `SshShell.runCommand` output behavior stays unchanged.

Resolves: ZSTAC-77375

Change-Id: I69756c636d6b65636c69616a7076756869646a63
<fix>[flatNetworkProvider]: refresh metadata after attaching non-default nic

See merge request zstackio/zstack!9905
<fix>[query]: ZSTAC-80742 limit API query concurrency

See merge request zstackio/zstack!9866
When kvmagent crashes/restarts, MN-side in-flight HTTP wrappers wait until the 5min default timeout before failing, blocking ChainTask queues (e.g. volume backup stuck). Add a drain trigger on the connect reply: when agent reports firstConnect=true, all pending HTTP calls scoped to that host with sentAtMillis < agentStartTime - 60s are failed immediately.

- RESTFacade.failPendingCallsForResourceBefore(resourceUuid, cutoffMillis, err): single-pass scan over wrappers, filters by X-Resource-UUID header + send time; calls wrapper.fail(err) (same path as natural timeout, idempotent via called.compareAndSet).
- AsyncHttpWrapper exposes getResourceUuid() + getSentAtMillis(); closure stores both at construction.
- KVMAgentCommands.ConnectResponse adds firstConnect + agentStartTimeMillis. PingResponse unchanged.
- KVMHost.connectToAgent drains stale calls inline when rsp.isFirstConnect() after a successful connect.
- New i18n error code ORG_ZSTACK_CORE_REST_10013 (en/zh).
- Groovy IT HostAgentRestartDrainCase drives the path via reconnectHost SDK action with KVM_CONNECT_PATH simulator override.

Resolves: ZSTAC-58310

Change-Id: I1c2d5e3f7a8b9c0d2e4f5a6b7c8d9e0f1a2b3c4d
Split ZNS DHCP error codes used by premium DHCP paths.\nAdd i18n mappings for the new concrete error templates.

Resolves: ZCF-3703

Change-Id: I11bf0fc6df8cf5a5e1ae24af958c5ca3712b4347
<fix>[kvm]: drain stale http calls on kvmagent restart

See merge request zstackio/zstack!9904
Move premium-only HA pre-fence details out of community while keeping the generic async before-start extension point support.

Jira: ZSTAC-83890

Test: mvn -pl header,compute,plugin/kvm,simulator/simulatorImpl,testlib -DskipTests compile

Change-Id: I168adf82338f9df9e76287619b7f76a8e5be695f
Resolves: ZSTAC-84691

Change-Id: I6668736975707575756d7665647867686e6b6b76
Pass VM API system tags into AllocateHostMsg so host allocator filters can honor znsNicMode during VM placement.

Resolves: ZCF-3728

Change-Id: I4fbb522440931aad005ceb82a37a29236bd35b27
<fix>[kvm]: allocate tty for selected KVM sudo commands

See merge request zstackio/zstack!9909
JIRA: ZCF-3703

Remove the duplicate systemTags field from SdnControllerEnableDHCPMsg.
NeedReplyMessage already owns systemTags, so redeclaring it makes Gson
fail while CloudBus logs the message before sending it to the ZNS DHCP
handler.

Change-Id: I4546348c9fd45004e0f9b9c823f9cccec9ad4774
APIImpact

Resolves: ZSTAC-84014

Change-Id: I6a7679766a6d666f6966787a757772706f77716a
Persist the resolved NIC driver type during AttachVmNicToVm and narrow prepareDbInitialValue to repair only null VNIC driverType records while respecting the VM virtio tag and platform, preventing existing virtio NICs from being overwritten as e1000 after management node restart.

Resolves: ZSTAC-85160

Change-Id: I6868776c140529e5286d40c187fde99608127f7a
<fix>[kvm]: extend reconnect echo timeout after libvirtd restart

See merge request zstackio/zstack!9883
<fix>[volume]: ReInitDisk support data Volume

See merge request zstackio/zstack!9908
Avoid bad host allocator lambda bytecode.
Set KVM capacity sync timeout explicitly.

Test: VmwareTest passed on root@172.20.13.240.

Resolves: ZCF-3728

Change-Id: I5db26740b9fef0c0ba1b31096a8fdc3659db412b
Avoid NPE when MtuGetter is constructed outside Spring.

Resolves: ZCF-3703

Change-Id: I21a88fca05ab157a767d569435c2d37a48d4179b
<fix>[zns]: ZCF-3728 propagate vm system tags

See merge request zstackio/zstack!9911
<fix>[ha]: stabilize suspect host pre-fence flow

See merge request zstackio/zstack!9885
<fix>[network]: ZCF-3703 add ZNS attach skip hook

See merge request zstackio/zstack!9900
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Review Change Stack

Walkthrough

此 PR 是一个跨子系统的大型提交:包含 SDN 控制器 inventory 与工厂重构、VM 网卡分配/生命周期扩展、KVM 主机/agent 行为改动、主存数据卷 reinit 流与消息、REST/webhook 客户端与协议、负载均衡健康检查扩展、网络扩展点与数据库迁移、文档与大量测试覆盖。

Changes

Monolithic cross-cutting changes

Layer / File(s) Summary
Cross-cutting feature set
header/..., compute/..., plugin/..., storage/..., core/..., network/..., conf/db/..., docs/..., test/...
SDN inventory normalization & annotations; new SDN status events and abstract factory; AfterAllocateSdnNic and VM NIC lifecycle manager; many VmAllocate*/VmDetach*/VmReturnReleaseNic flows; KVM host/agent and KVMHostUtils updates (pseudo-tty, TLS IP union, libvirt echo timeout); ReInitDataVolume message/flows and primary storage handlers; RESTFacade pending-call failure injection and WebhookCallbackClient/Protocol; LoadBalancer health-check none handling; schema migrations, docs and tests.

Estimated code review effort
🎯 5 (Critical) | ⏱️ ~120 minutes

小兔子来巡山,代码林中跃欢颜,
IP 斩去冒号尾,流程扩展又添盘,
KVM 与存储舞,Webhook 护桥环,
文档测试共登场,合并请细心检看。 🐇✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/shixin.ruan/shixin-ZCF-2747@@2

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@header/src/main/java/org/zstack/header/network/sdncontroller/SdnControllerInventory.java`:
- Around line 62-75: The toHostAddress method currently returns the original
endpoint for many cases (losing the trimmed normalization) and doesn't handle
bracketed IPv6 addresses like [fe80::1]:8080; update to always operate on the
trimmed value, add a branch that detects a bracketed IPv6 host (startsWith('[')
and contains ']' followed by ':' ) and strip the trailing :port after the
closing ']', and keep the existing single-colon host:port logic
(firstColon==lastColon) to remove the port for IPv4/hostname cases; ensure the
method returns the normalized host string in all branches using the trimmed
variable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 28096781-e83b-492a-b998-27b35709c2f3

📥 Commits

Reviewing files that changed from the base of the PR and between f41c0a5 and 4830d88.

📒 Files selected for processing (1)
  • header/src/main/java/org/zstack/header/network/sdncontroller/SdnControllerInventory.java

@MatheMatrix MatheMatrix force-pushed the sync/shixin.ruan/shixin-ZCF-2747@@2 branch 2 times, most recently from 8d3fb7c to a626ed9 Compare May 20, 2026 09:40
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
docs/modules/network/pages/networkResource/ZnsIntegration.adoc (1)

1-976: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

文档新增内容为中文,违反仓库统一英文规范。

该文件当前几乎全部为中文描述,建议改为英文(标题、正文、注释示例说明等)以满足仓库规范。

As per coding guidelines: **/*.*: 代码里不应当有有中文,包括报错、注释等都应当使用正确的、无拼写错误的英文来写

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/modules/network/pages/networkResource/ZnsIntegration.adoc` around lines
1 - 976, The document is written in Chinese and must be converted to English to
comply with repository language rules; please translate all user-facing text
(titles like "= 对接ZNS SDN控制器", section headings, paragraphs, NOTE blocks,
tables, examples, and inline code comments) into clear, idiomatic English and
preserve all technical identifiers and symbols (e.g. Cms struct, Segment,
ZnsControllerVO, SdnControllerVO, ZnsTransportZoneVO, L2NetworkVO,
APICreateL2NetworkMsg, ZnsApiInterceptor, X-Api-Versions, X-Api-Version,
endpoints such as POST /zns/notify/add-compute-manager, OPTIONS
/zns/api/v1/segments, and other API paths) as-is; ensure code blocks and JSON
examples remain syntactically unchanged except for translated surrounding
explanatory text, keep NOTE/NOTE-like admonitions translated, and maintain any
RFC/format examples (UUID formats, headers, tables) exactly while replacing
Chinese descriptions with English equivalents.
plugin/flatNetworkProvider/src/main/java/org/zstack/network/service/flat/FlatDhcpBackend.java (1)

997-1005: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

先读已有 tag,再做 capability gating。

现在一进来就按 isAllocateDhcpServerIp(l3Uuid, ipVersion) 早退,意味着历史上已经存在的 DHCP server IP tag 也拿不到了。升级后如果 provider 对 IPv6 改为 isAllocateDhcpv6ServerIp() == false,这里会把旧的 DHCPv6 地址当成不存在,直接改掉既有行为。更稳妥的顺序是:先返回已有 tag,只有在“当前没有 tag、需要新分配”时才按 capability 返回 null

建议修改
     `@Deferred`
     private String allocateDhcpIp(String l3Uuid, int ipVersion, boolean allocate_ip, String requiredIp, String excludedIp) {
-        if (!isAllocateDhcpServerIp(l3Uuid, ipVersion)) {
-            return null;
-        }
-
         Map<String, String> dhcpServerMap = getExistingDhcpServerIp(l3Uuid, ipVersion);
         if (!dhcpServerMap.isEmpty()) {
             return dhcpServerMap.keySet().iterator().next();
         }
+
+        if (!isAllocateDhcpServerIp(l3Uuid, ipVersion)) {
+            return null;
+        }

As per coding guidelines "向后兼容原则:之前的代码产生的行为不要直接去改动,即使是 bug 也可能用户有依赖,不应直接改动。"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugin/flatNetworkProvider/src/main/java/org/zstack/network/service/flat/FlatDhcpBackend.java`
around lines 997 - 1005, The current allocateDhcpIp method calls
isAllocateDhcpServerIp(l3Uuid, ipVersion) first which prevents returning
historically tagged DHCP server IPs; change the logic to first call
getExistingDhcpServerIp(l3Uuid, ipVersion) and if that map is non-empty return
its key immediately, and only when there is no existing tag consult
isAllocateDhcpServerIp(l3Uuid, ipVersion) and return null if capability denies
allocation; keep the rest of allocateDhcpIp parameters/flow (allocate_ip,
requiredIp, excludedIp) unchanged and ensure this preserves backward
compatibility.
search/src/main/java/org/zstack/query/QueryFacadeImpl.java (1)

352-361: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

并发化后这里会踩到懒加载缓存的竞态。

syncSubmit("api-query")doQuery() 可以并发执行,但 autoQueryMapreplySetter 仍是查询路径里懒写入的普通 HashMap。这两个 get/put 在高并发下不是线程安全的,轻则重复反射,重则把缓存状态打坏并导致随机查询失败。这里建议改成 ConcurrentHashMap + computeIfAbsent,或者在 start() 里一次性预热这些缓存。

Also applies to: 385-415

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@search/src/main/java/org/zstack/query/QueryFacadeImpl.java` around lines 352
- 361, The lazy-populated plain HashMaps used in doQuery (autoQueryMap and
replySetter) are not thread-safe under concurrent syncSubmit("api-query") calls;
replace their declarations with ConcurrentHashMap and change lazy get/put sites
in doQuery (and the similar block at lines ~385-415) to use computeIfAbsent to
atomically cache reflections, or alternatively prewarm these maps in start() so
they are fully populated before concurrent queries run; update references to
autoQueryMap.get/put and replySetter.get/put accordingly to use computeIfAbsent
(or remove lazy population if prewarming).
storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java (1)

1117-1133: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

失败分支不要提前 reply,且回滚前要判空 UUID

这里的失败分支先 bus.reply(msg, ret),再 trigger.fail(...),外层 error handler 会再回一次;而 capability 检查阶段失败时 vo.getUuid() 还没赋值,后面的 rollbackSnapshot(vo.getUuid()) 会把原始错误覆盖掉。

建议只在内层设置错误并 trigger.fail(...),把 reply 统一留给 error handler;同时在 error handler 里先判断 vo.getUuid() != null 再回滚。

💡 建议修改
                         MessageReply reply = bus.call(askMsg);
                         if (!reply.isSuccess()) {
                             ret.setError(operr(ORG_ZSTACK_STORAGE_SNAPSHOT_10016, reply.getError(),
                                     "cannot ask primary storage[uuid:%s] for volume snapshot capability",
                                     vol.getUuid()));
-                            bus.reply(msg, ret);
                             trigger.fail(ret.getError());
                             return;
                         }
@@
                         if (!capability.isSupport()) {
                             ret.setError(operr(ORG_ZSTACK_STORAGE_SNAPSHOT_10017, "primary storage[uuid:%s] doesn't support volume snapshot;" +
                                     " cannot create snapshot for volume[uuid:%s]", primaryStorageUuid, vol.getUuid()));
-                            bus.reply(msg, ret);
                             trigger.fail(ret.getError());
                             return;
                         }
@@
                 error(new FlowErrorHandler(msg) {
                     `@Override`
                     public void handle(ErrorCode errCode, Map data) {
-                        rollbackSnapshot(vo.getUuid());
+                        if (vo.getUuid() != null) {
+                            rollbackSnapshot(vo.getUuid());
+                        }
                         ret.setError(errCode);
                         bus.reply(msg, ret);
                     }
                 });

Also applies to: 1227-1233

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java`
around lines 1117 - 1133, The failure branches inside the
AskVolumeSnapshotCapabilityReply handling currently call bus.reply(msg, ret)
before trigger.fail(...), and later rollbackSnapshot(vo.getUuid()) can overwrite
the original error because vol/vo.getUuid() may be null; remove the inner
bus.reply(msg, ret) calls in the capability failure paths (leave only
ret.setError(...) and trigger.fail(...)) so replies are handled by the outer
error handler, and update the outer error handler to check vo.getUuid() != null
before calling rollbackSnapshot(vo.getUuid()); apply the same change to the
similar block around the other capability check (lines 1227-1233).
rest/src/main/resources/scripts/GoInventory.groovy (1)

1356-1368: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

other_actions.go 文件在无方法内容时因未使用的 context 导入而编译失败。

generateOtherActionsFile() 没有条件判断,即使 methodsContent 为空也会生成文件;新增的 context 导入缺少占位符引用(如 paramview 那样的 var _ = ...)。当没有孤立的 action 需要生成时,Go 编译会因未使用的导入报错。

建议添加占位符引用来消除该导入:

修复建议
         content.append("import (\n")
         content.append("\t\"context\"\n")
         if (needsFmt) {
             content.append("\t\"fmt\"\n")
         }
         content.append("\t\"github.com/zstackio/zstack-sdk-go-v2/pkg/param\"\n")
         content.append("\t\"github.com/zstackio/zstack-sdk-go-v2/pkg/view\"\n")
         content.append(")\n\n")
+        content.append("var _ = context.Background // avoid unused import\n")
         content.append("var _ = param.BaseParam{} // avoid unused import\n")
         content.append("var _ = view.MapView{} // avoid unused import\n\n")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest/src/main/resources/scripts/GoInventory.groovy` around lines 1356 - 1368,
The generated Go file always imports "context" even when methodsContent is empty
causing unused-import compile errors; in generateOtherActionsFile() (in
GoInventory.groovy) either skip adding the "context" import when no methods are
emitted or add a placeholder reference like `var _ = context.Background()`
(similar to the existing `var _ = param.BaseParam{}` and `var _ =
view.MapView{}`) into the generated content so the context import is used when
methodsContent is empty.
rest/src/main/resources/scripts/GoApiTemplate.groovy (1)

1165-1189: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

generateGenericMethod 的 PUT 调用漏传 ctx,会生成不可编译的 Go 代码

cli.Put/cli.PutWithRespKey 的调用参数在这些分支里仍是旧签名拼接,未带 ctx,与本次统一改造后的其他分支不一致。

建议修复(示例)
- String putArgs = unwrap ? 
-     """("${cleanPath}", ${paramName}, map[string]interface{}{
+ String putArgs = unwrap ? 
+     """(ctx, "${cleanPath}", ${paramName}, map[string]interface{}{
 \t\t"${actionKey}": map[string]interface{}{},
 \t}, &resp)""" :
-     """("${cleanPath}", ${paramName}, "", map[string]interface{}{
+     """(ctx, "${cleanPath}", ${paramName}, "", map[string]interface{}{
 \t\t"${actionKey}": map[string]interface{}{},
 \t}, &resp)"""
- String putArgs = unwrap ?
-     """("${cleanPath}", uuid, params, &resp)""" :
-     """("${cleanPath}", uuid, "", params, &resp)"""
+ String putArgs = unwrap ?
+     """(ctx, "${cleanPath}", uuid, params, &resp)""" :
+     """(ctx, "${cleanPath}", uuid, "", params, &resp)"""

Also applies to: 1232-1256, 1263-1287

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest/src/main/resources/scripts/GoApiTemplate.groovy` around lines 1165 -
1189, The PUT calls generated by generateGenericMethod are missing the context
parameter: update the putArgs constructions for both branches that use cli.Put
and cli.PutWithRespKey so that ctx is passed as the first argument (e.g. change
arguments from ("${cleanPath}", ${paramName}, ...) or ("${cleanPath}",
${paramName}, "", ...) to (ctx, "${cleanPath}", ${paramName}, ...) or (ctx,
"${cleanPath}", ${paramName}, "", ...)), doing this for both the unwrap and
non-unwrap variants in the blocks that build putArgs (including the other
occurrences you noted around lines 1232-1256 and 1263-1287) so the generated
functions call cli.Put/cli.PutWithRespKey(ctx, ...) and compile correctly.
plugin/sdnController/src/main/java/org/zstack/sdnController/SdnControllerBase.java (1)

192-194: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

用户名变更判断比较了错误字段

这里把 msg.getUserName()self.getName() 比较,应该与 self.getUsername() 比较,否则会导致用户名变更判断不正确。

建议修复
- if (msg.getUserName() != null && !msg.getUserName().equals(self.getName())) {
+ if (msg.getUserName() != null && !msg.getUserName().equals(self.getUsername())) {
     chain.getData().put(SDN_CONTROLLER_USERNAME, self.getName());
     self.setUsername(msg.getUserName());
     changed = true;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugin/sdnController/src/main/java/org/zstack/sdnController/SdnControllerBase.java`
around lines 192 - 194, The username-change check in SdnControllerBase
incorrectly compares msg.getUserName() to self.getName(); change the comparison
to msg.getUserName() != self.getUsername() so the logic tests against the
current username, and when storing the previous username into chain.getData()
under SDN_CONTROLLER_USERNAME use self.getUsername() (the prior value) before
calling self.setUsername(msg.getUserName()); update the block around
SDN_CONTROLLER_USERNAME, self.getUsername(), self.setUsername(...), and
msg.getUserName() accordingly.
🟠 Major comments (25)
header/src/main/java/org/zstack/header/vm/VmNicInventory.java-43-44 (1)

43-44: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

新增 interfaceId 后缺少 VO->Inventory 映射,导致字段值可能丢失。

VmNicInventory(VmNicVO vo) 当前没有给 interfaceId 赋值,valueOf() 路径构建出的 inventory 可能始终拿不到该字段。

建议修复
 public VmNicInventory(VmNicVO vo) {
     this.setUuid(vo.getUuid());
     this.setVmInstanceUuid(vo.getVmInstanceUuid());
@@
     this.setDriverType(vo.getDriverType());
+    this.setInterfaceId(vo.getInterfaceId());
     this.setType(vo.getType());
     this.setState(vo.getState().toString());
 }

Also applies to: 228-234

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@header/src/main/java/org/zstack/header/vm/VmNicInventory.java` around lines
43 - 44, The new private field interfaceId in VmNicInventory is not populated
from the VO; update the VmNicInventory(VmNicVO vo) constructor (and any other
factory/path like the static valueOf(VmNicVO) method) to read
vo.getInterfaceId() and assign it to this.interfaceId so the inventory retains
the value; ensure both the constructor and valueOf mapping consistently set
interfaceId from VmNicVO.
network/src/main/java/org/zstack/network/l2/L2NetworkExtensionPointEmitter.java-56-68 (1)

56-68: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

beforeUpdate 中吞掉异常会绕过扩展点校验。

Line 63-66 当前仅记录日志并继续执行;在“before”阶段这会让扩展失败后仍可能继续改 VLAN,存在一致性风险。建议把异常继续抛给上层处理。

建议修改
 public void beforeUpdate(final L2NetworkInventory inv) {
     for (L2NetworkUpdateExtensionPoint ext : updateExtensions) {
         try {
             ext.beforeChangeL2NetworkVlanId(inv);
         } catch (RuntimeException e) {
-            // propagate validation failures and other runtime exceptions immediately
             throw e;
         } catch (Exception e) {
-            logger.warn(String.format("unhandled exception in L2NetworkUpdateExtensionPoint.beforeChangeL2NetworkVlanId of %s",
-                    ext.getClass().getCanonicalName()), e);
+            throw new RuntimeException(String.format(
+                    "exception in L2NetworkUpdateExtensionPoint.beforeChangeL2NetworkVlanId of %s",
+                    ext.getClass().getCanonicalName()), e);
         }
     }
 }
As per coding guidelines: “在必要时,应继续抛出异常,使上层业务处理者可以转换为用户友好的错误提示。”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@network/src/main/java/org/zstack/network/l2/L2NetworkExtensionPointEmitter.java`
around lines 56 - 68, The beforeUpdate method in L2NetworkExtensionPointEmitter
swallows non-RuntimeExceptions from each
L2NetworkUpdateExtensionPoint.beforeChangeL2NetworkVlanId call, which bypasses
extension validation; change the error handling in the for-loop over
updateExtensions so that any Exception (not just RuntimeException) is propagated
to the caller (e.g., rethrow the caught Exception or wrap and throw a
RuntimeException) instead of only logging it, ensuring upstream callers of
beforeUpdate receive and can handle the failure.
plugin/xinfini/src/main/java/org/zstack/xinfini/XInfiniStorageController.java-279-281 (1)

279-281: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

请通过 comp.fail(...) 返回失败,而不是直接抛异常。

Line [279] 直接抛 OperationFailureException 会绕过 Completion 回调语义;调用方若依赖回调收敛,可能出现流程提前中断或未按预期处理失败。

💡 建议修改
     `@Override`
     public void blacklist(String installPath, String protocol, HostInventory h, Completion comp) {
-        throw new OperationFailureException(operr("xinfini does not support volume path isolation yet, " +
-                        "abort starting VM on host[uuid:%s, ip:%s] to prevent split-brain on volume[path:%s, protocol:%s]",
-                h.getUuid(), h.getManagementIp(), installPath, protocol));
+        comp.fail(operr("xinfini does not support volume path isolation yet, " +
+                        "abort starting VM on host[uuid:%s, ip:%s] to prevent split-brain on volume[path:%s, protocol:%s]",
+                h.getUuid(), h.getManagementIp(), installPath, protocol));
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugin/xinfini/src/main/java/org/zstack/xinfini/XInfiniStorageController.java`
around lines 279 - 281, 在 XInfiniStorageController 中不要直接抛出
OperationFailureException(目前在那段代码用 throw new
OperationFailureException(...)),改为通过已有的 Completion 对象返回失败:调用
comp.fail(operr("xinfini does not support volume path isolation yet, abort
starting VM on host[uuid:%s, ip:%s] to prevent split-brain on volume[path:%s,
protocol:%s]", h.getUuid(), h.getManagementIp(), installPath, protocol));
确保使用与当前上下文中可用的 Completion 变量名(如 comp 或 completion),并移除原先的 throw 语句以保持回调语义一致。
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java-39-39 (1)

39-39: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

stopped 仅保证可见性,无法避免 stop/track 的竞态

当前是典型 check-then-act 竞态:trackHost() 可能在读取到 stopped=false 后,和 stop() 并发交错,最终在 stop() 之后仍启动新的 Tracker

🔧 建议修复(统一生命周期锁)
 public class HostTrackImpl implements HostTracker, ManagementNodeChangeListener, Component, ManagementNodeReadyExtensionPoint {
@@
     private Map<String, Tracker> trackers = new ConcurrentHashMap<>();
     private volatile boolean stopped;
+    private final Object lifecycleLock = new Object();
@@
     public void trackHost(String hostUuid) {
-        if (stopped) {
-            logger.debug(String.format("skip tracking host[uuid:%s], host tracker is stopped", hostUuid));
-            return;
-        }
+        synchronized (lifecycleLock) {
+            if (stopped) {
+                logger.debug(String.format("skip tracking host[uuid:%s], host tracker is stopped", hostUuid));
+                return;
+            }

-        if (!destMaker.isManagedByUs(hostUuid)) {
-            logger.debug(String.format("skip tracking host[uuid:%s], not managed by us", hostUuid));
-            return;
-        }
+            if (!destMaker.isManagedByUs(hostUuid)) {
+                logger.debug(String.format("skip tracking host[uuid:%s], not managed by us", hostUuid));
+                return;
+            }

-        Tracker t = trackers.get(hostUuid);
-        if (t != null) {
-            t.cancel();
-        }
+            Tracker t = trackers.get(hostUuid);
+            if (t != null) {
+                t.cancel();
+            }

-        t = new Tracker(hostUuid);
-        trackers.put(hostUuid, t);
+            t = new Tracker(hostUuid);
+            trackers.put(hostUuid, t);

-        if (CoreGlobalProperty.UNIT_TEST_ON && !alwaysStartRightNow) {
-            t.start();
-        } else {
-            t.startRightNow();
-        }
+            if (CoreGlobalProperty.UNIT_TEST_ON && !alwaysStartRightNow) {
+                t.start();
+            } else {
+                t.startRightNow();
+            }

-        logger.debug(String.format("starting tracking hosts[uuid:%s]", hostUuid));
+            logger.debug(String.format("starting tracking hosts[uuid:%s]", hostUuid));
+        }
     }
@@
     public boolean start() {
-        stopped = false;
+        synchronized (lifecycleLock) {
+            stopped = false;
+        }
         populateExtensions();
@@
     public boolean stop() {
-        stopped = true;
-        new HashSet<>(trackers.values()).forEach(Tracker::cancel);
-        trackers.clear();
+        synchronized (lifecycleLock) {
+            stopped = true;
+            new HashSet<>(trackers.values()).forEach(Tracker::cancel);
+            trackers.clear();
+        }
         return true;
     }

Also applies to: 255-258, 294-297, 367-367, 436-439

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` at line 39,
HostTrackImpl's volatile stopped causes a check-then-act race between
trackHost() and stop() so new Trackers may start after stop(); fix by adding a
dedicated lifecycle lock (e.g., a private final ReentrantLock lifecycleLock) and
use it to guard all accesses to stopped and to the Tracker start/stop sequences:
acquire the lock in trackHost(), check stopped and only create/start the Tracker
while holding the lock; acquire the lock in stop(), set stopped and stop/cleanup
any existing Tracker while holding the lock; remove reliance on volatile-only
semantics for stopped and ensure all methods that inspect or mutate tracker
lifecycle (trackHost(), stop(), any start/stop helper methods) use the same
lifecycleLock to make the check-and-act atomic.
network/src/main/java/org/zstack/network/l2/L2NetworkApiInterceptor.java-118-123 (1)

118-123: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

hasSdnControllerTag 方法应使用标准的 tag 匹配 API 而非简单的字符串前缀匹配。

当前实现通过 tag.startsWith("SdnControllerUuid::") 判断系统标签,这种方式过于简陋且容易出错。实际的 SDN controller tag 格式由 PatternedSystemTag 定义为 SdnControllerUuid::{SdnControllerUuid},应使用标准的 isMatch() 方法:

private boolean hasSdnControllerTag(List<String> systemTags) {
    if (systemTags == null || systemTags.isEmpty()) {
        return false;
    }
    return systemTags.stream().anyMatch(tag -> L2NetworkSystemTags.L2_NETWORK_SDN_CONTROLLER_UUID.isMatch(tag));
}

这样可以确保在 138-140 行的验证逻辑中准确判断是否存在 SDN controller 标签,避免错误地要求 physicalInterface

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@network/src/main/java/org/zstack/network/l2/L2NetworkApiInterceptor.java`
around lines 118 - 123, Replace the naive prefix check in hasSdnControllerTag
with the PatternedSystemTag matcher: locate the hasSdnControllerTag method and
change its stream anyMatch predicate to call
L2NetworkSystemTags.L2_NETWORK_SDN_CONTROLLER_UUID.isMatch(tag) instead of
tag.startsWith(...), keeping the null/empty guard; this ensures the SDN
controller tag is detected using the standard tag-matching API used elsewhere.
conf/db/upgrade/V5.5.22__schema.sql-134-148 (1)

134-148: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

ZnsTenantRouterVO.sdnControllerUuid 少了控制器级别的外键。

tenantUuid 这里是可空的,但表里只有 tenantUuid -> ZnsTenantVO 的 FK。未绑定 tenant 的 router 在删除 controller 后不会被级联清理,会留下孤儿记录;而且 sdnControllerUuid 本身也失去了数据库层的完整性约束。建议补上 sdnControllerUuid -> SdnControllerVO(uuid) 的 FK。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@conf/db/upgrade/V5.5.22__schema.sql` around lines 134 - 148, The
ZnsTenantRouterVO table is missing a foreign key on sdnControllerUuid causing
orphaned rows and loss of DB integrity; add a FK constraint on
ZnsTenantRouterVO.sdnControllerUuid referencing SdnControllerVO.uuid (e.g.
CONSTRAINT fk_zns_tr_controller FOREIGN KEY (`sdnControllerUuid`) REFERENCES
`SdnControllerVO`(`uuid`) ON DELETE CASCADE) so that router rows are
cascaded/validated when controllers are removed and the sdnControllerUuid column
is enforced at the DB level.
compute/src/main/java/org/zstack/compute/vm/VmDetachNicFlow.java-98-116 (1)

98-116: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

不要在 ReturnIpMsg 失败后继续删 NIC。

errorCodeList 在这里被完全忽略了;只要有一个 IP 归还失败,后面仍会执行 dbf.removeByPrimaryKey(...)。这样会把 VmNicVO 删掉,但失败的 UsedIpVO 还留在库里,后续既难重试也会直接泄漏地址。至少应在有归还失败时中断删除并把错误往上抛。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@compute/src/main/java/org/zstack/compute/vm/VmDetachNicFlow.java` around
lines 98 - 116, The current done(ErrorCodeList errorCodeList) in the
WhileDoneCompletion ignores errorCodeList and always calls callReleaseSdnNics
then deletes the VmNicVO, which can remove the NIC while IP release failed;
change the logic in done(ErrorCodeList) to inspect errorCodeList (from
WhileDoneCompletion) and if it contains any errors, do not call
dbf.removeByPrimaryKey(nic.getUuid(), VmNicVO.class) or proceed with
trigger.next — instead propagate an aggregated ErrorCode (or fail the flow) so
the caller can handle retry/compensation; only when errorCodeList is empty/ok
should you call callReleaseSdnNics and then remove the VmNicVO and trigger.next.
Ensure you update references in done, callReleaseSdnNics callback handling, and
any places expecting trigger.next on success so failures are surfaced.
plugin/kvm/src/main/java/org/zstack/kvm/VmNicLifecycleKvmBridge.java-64-107 (1)

64-107: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

不要把不同的 VmNicLifecycleExtensionPoint 并发执行。

这里把并发度设成 extensions.size(),会让多个扩展同时对同一台 host 的 NIC 集合做 reconcileOnHost。扩展点契约里看不到并发安全保证,这种并发很容易引入竞态、重复回收或互相超时。更稳妥的是串行执行(each 或并发度 1)。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugin/kvm/src/main/java/org/zstack/kvm/VmNicLifecycleKvmBridge.java` around
lines 64 - 107, The loop currently runs with concurrency extensions.size(),
allowing multiple VmNicLifecycleExtensionPoint implementations in extensions to
call reconcileOnHost concurrently which can cause race conditions; change the
While<> invocation to run serially (concurrency = 1) or use the serial helper
(each) so that extensions are processed one-by-one; keep the existing timeout
handling (thdf.submitTimeoutTask and ThreadFacadeImpl.TimeoutTaskReceipt),
completed AtomicBoolean and the try/catch around ext.reconcileOnHost, but
replace the extensions.size() concurrency argument with 1 (or switch to the
serial iterator variant) to ensure no two extensions run reconcileOnHost for the
same host simultaneously.
conf/db/upgrade/V5.5.22__schema.sql-87-148 (1)

87-148: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

新增 ZNS 表和外键需要显式带上 zstack schema。

这一段新建表和 FK 引用都改成了未限定 schema 的名字,例如 CREATE TABLE IF NOT EXISTS \ZnsControllerVO`REFERENCES `SdnControllerVO`。升级脚本在这个目录里默认约定是显式使用 zstack....``;否则默认库不是 zstack 时,表可能建到错误 schema,外键解析也会变得不稳定。 Based on learnings: In ZStack, the database schema name is fixed as 'zstack'. For SQL upgrade scripts under conf/db/upgrade, continue using TABLE_SCHEMA = 'zstack' and reference tables as zstack.`UsedIpVO`.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@conf/db/upgrade/V5.5.22__schema.sql` around lines 87 - 148, The new ZNS
tables and their foreign keys are created without the explicit zstack schema;
update all CREATE TABLE and REFERENCES targets to qualify the schema as zstack
(e.g., zstack.`ZnsControllerVO`, zstack.`ZnsTransportZoneVO`,
zstack.`ZnsTenantVO`, zstack.`ZnsTenantRouterVO`) and change FK references to
zstack.`SdnControllerVO`, zstack.`L2NetworkEO`, and zstack.`ZnsTenantVO`
accordingly (or ensure TABLE_SCHEMA = 'zstack' is used consistently in this
migration), so the tables and CONSTRAINTs are created in the zstack schema and
foreign-key resolution remains stable.
network/src/main/java/org/zstack/network/l2/L2NoVlanNetwork.java-972-975 (1)

972-975: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

这里的 ZNS 特判是单向的。

现在只在待挂载的 tl2 是 ZNS 时直接 return。如果集群里已经有 ZNS 的 L2NoVlanNetwork,它仍会留在 l2s 里参与后面的 physicalInterface 冲突判断,从而错误拦截普通 L2NoVlanNetwork 的挂载。这里应像下面的 VLAN 分支一样在查询阶段排除 vSwitchType = ZNS,或在循环里跳过这些记录。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@network/src/main/java/org/zstack/network/l2/L2NoVlanNetwork.java` around
lines 972 - 975, The current ZNS special-case is one-way: you return early when
the to-be-attached tl2 has vSwitchType==L2NetworkConstant.VSWITCH_TYPE_ZNS, but
existing L2NoVlanNetwork records in l2s with vSwitchType ZNS are still
considered in the physicalInterface conflict checks and can block non-ZNS
attachments. Fix by excluding ZNS networks symmetrically—either filter out
entries with vSwitchType==L2NetworkConstant.VSWITCH_TYPE_ZNS when
building/querying l2s (same approach as the VLAN branch) or, inside the loop
that iterates l2s during the physicalInterface conflict check, explicitly skip
records whose getvSwitchType() equals L2NetworkConstant.VSWITCH_TYPE_ZNS so ZNS
networks don’t participate in conflict detection for non-ZNS attachments (refer
to tl2, l2s, getvSwitchType(), physicalInterface check).
conf/db/upgrade/V5.5.16__schema.sql-357-358 (1)

357-358: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

这里不要再引入 0000-00-00 00:00:00 作为 TIMESTAMP 默认值。

VmModelMountVO.createDate 现在用的是零时间。这个值在严格 SQL mode / NO_ZERO_DATE 下会直接导致升级失败,而且本仓库最近的迁移已经统一改成非零哨兵值来兼容双 TIMESTAMP 场景。建议改成 '2000-01-01 00:00:00'。 Based on learnings: In ZStack MySQL upgrade scripts under conf/db/upgrade/*.sql, if a table already contains a TIMESTAMP column defined with ON UPDATE CURRENT_TIMESTAMP, then any second TIMESTAMP column must use a non-CURRENT_TIMESTAMP sentinel default such as '2000-01-01 00:00:00'.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@conf/db/upgrade/V5.5.16__schema.sql` around lines 357 - 358, The TIMESTAMP
default for VmModelMountVO.createDate currently uses '0000-00-00 00:00:00',
which breaks strict SQL modes and multi-TIMESTAMP tables; update the createDate
column definition (the TIMESTAMP that is not the ON UPDATE CURRENT_TIMESTAMP
one) to use a non-zero sentinel default such as '2000-01-01 00:00:00' instead of
'0000-00-00 00:00:00' so it is compatible with the existing lastOpDate
(CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP) column and strict MySQL modes.
plugin/sdnController/src/main/java/org/zstack/sdnController/SdnControllerFactory.java-15-25 (1)

15-25: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

resolveStatusevent == null 没有保护,会触发 NPE。

switch (event)event 为空时会直接异常,建议先做空值兜底并返回 vo.getStatus()

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugin/sdnController/src/main/java/org/zstack/sdnController/SdnControllerFactory.java`
around lines 15 - 25, The resolveStatus method in SdnControllerFactory should
guard against a null event to avoid an NPE: at the start of
resolveStatus(SdnControllerStatusEvent event, SdnControllerVO vo) check if event
is null and immediately return vo.getStatus(); then proceed with the existing
switch on event. This ensures resolveStatus always returns the current status
when event is missing without changing the existing case handling.
plugin/loadBalancer/src/main/java/org/zstack/network/service/lb/LoadBalancerApiInterceptor.java-1494-1497 (1)

1494-1497: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

这里存在空指针风险:healthTarget 可能为 null

healthTarget.split(":") 在缺失历史 tag 或脏数据场景下会直接抛 NPE,建议先判空并给出一致的参数错误。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugin/loadBalancer/src/main/java/org/zstack/network/service/lb/LoadBalancerApiInterceptor.java`
around lines 1494 - 1497, The code in LoadBalancerApiInterceptor uses
healthTarget.split(":") without guarding against healthTarget being null, which
can cause an NPE; before calling split (inside the block where
msg.getHealthCheckProtocol() equals
LoadBalancerConstants.HEALTH_CHECK_TARGET_PROTOCL_HTTP), validate that
healthTarget is not null or empty and if it is throw the same
OperationFailureException using argerr(ORG_ZSTACK_NETWORK_SERVICE_LB_10119, ...)
to return a consistent parameter error; update the check around healthTarget
(the variable used to build ts) to perform a null/empty check first, then split
and validate ts.length == 2 as currently implemented.
compute/src/main/java/org/zstack/compute/vm/VmAllocateNicFlow.java-124-157 (1)

124-157: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

失败分支不能绕过 VmInstanceNicFactory 直接删库。

nicVO 是通过 vnicFactory.createVmNic() 创建的,但扩展失败时这里只删除了 VmNicVO 主表。若具体工厂还创建了类型相关记录或标签,这里会留下脏数据;而 Line 326-328 的 rollback 已经表明正确的释放路径是 vnicFactory.releaseVmNic(...)。建议这里也走工厂释放,而不是手工删主表。

建议修改
                 `@Override`
                 public void fail(ErrorCode errorCode) {
                     try {
-                        dbf.removeByPrimaryKey(nicVO.getUuid(), VmNicVO.class);
+                        vnicFactory.releaseVmNic(nic);
                     } catch (Throwable t) {
                         logger.warn(String.format("failed to remove VmNicVO[uuid:%s] after before allocate extension failure",
                                 nicVO.getUuid()), t);
                     }
                     errs.add(errorCode);

As per coding guidelines "策略与机制分离:接口定义一个消息处理机制,而具体业务由不同实现类完成,从而减少耦合性,提高代码可读性。"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@compute/src/main/java/org/zstack/compute/vm/VmAllocateNicFlow.java` around
lines 124 - 157, The failure branch currently removes only the VmNicVO row
created by vnicFactory.createVmNic(nic, spec) using
dbf.removeByPrimaryKey(nicVO.getUuid(), VmNicVO.class), which can leave
factory-specific records/tags orphaned; replace that manual delete with the
factory’s release path by calling vnicFactory.releaseVmNic(nicVO.getUuid(),
spec) (or the appropriate releaseVmNic method signature used at lines 326-328)
inside the fail() block, and remove the dbf.removeByPrimaryKey call and its
try/catch so the factory cleans up all related resources consistently (keep
errs.add(errorCode) and wcomp.allDone()).
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java-5455-5460 (1)

5455-5460: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

创建新数据卷后未采用 agent 返回路径,可能回传错误 installPath

CreateEmptyVolumeRspinstallPath 在这里被忽略了;如果后端返回了规范化路径,当前逻辑会继续使用请求路径,导致后续返回/清理路径不一致。

🔧 建议修复
@@
                         httpCall(CREATE_VOLUME_PATH, cmd, CreateEmptyVolumeRsp.class, new ReturnValueCompletion<CreateEmptyVolumeRsp>(trigger) {
                             `@Override`
                             public void success(CreateEmptyVolumeRsp ret) {
                                 newVolumeCreated = true;
+                                newVolumePath = buildEmptyVolumeInstallPath(targetPoolName, cmd.installPath, ret.getInstallPath());
                                 trigger.next();
                             }

Also applies to: 5505-5506

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java`
around lines 5455 - 5460, The success callback for the httpCall to
CREATE_VOLUME_PATH (ReturnValueCompletion<CreateEmptyVolumeRsp> in
CephPrimaryStorageBase) ignores the agent-returned
CreateEmptyVolumeRsp.installPath and continues using the original request path,
which can cause mismatched install/cleanup paths; update the success handler to
check ret.getInstallPath() (or ret.installPath) and, if non-empty, replace the
local installPath/volumePath variable with this normalized value before setting
newVolumeCreated and calling trigger.next(); apply the same change to the
analogous handler around the other occurrence referenced (lines ~5505-5506) so
all code paths use the agent-provided install path for subsequent responses and
cleanup.
storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java-1037-1043 (1)

1037-1043: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

用同步后的 sp 状态发 canonical event

syncVolumeSnapshotDBAfterTakeSnapshot() 已经把库里的快照状态统一成最终值并返回了 sp,这里仍然读取 snapshot.getStatus() 会让事件状态和最终 inventory 脱节;如果后端 reply 没带状态或还是旧值,valueOf(...) 这里也会直接炸。

💡 建议修改
                         new FireSnapShotCanonicalEvent().
                                 fireSnapShotStatusChangedEvent(
-                                        VolumeSnapshotStatus.valueOf(snapshot.getStatus()),
+                                        VolumeSnapshotStatus.valueOf(sp.getStatus()),
                                         sp);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java`
around lines 1037 - 1043, The canonical event is using snapshot.getStatus()
which may be stale or null; use the synchronized inventory returned by
syncVolumeSnapshotDBAfterTakeSnapshot (sp) when firing the event instead.
Replace VolumeSnapshotStatus.valueOf(snapshot.getStatus()) with a safe
conversion from sp.getStatus() (e.g.
VolumeSnapshotStatus.valueOf(sp.getStatus()) or a null/invalid-safe mapper) when
calling new FireSnapShotCanonicalEvent().fireSnapShotStatusChangedEvent(..., sp)
so the event status always matches the persisted inventory and avoid
NPE/IllegalArgumentException for missing/invalid backend status.
core/src/main/java/org/zstack/core/rest/webhook/WebhookCallbackClient.java-101-109 (1)

101-109: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

先登记 pendingCalls,再启动超时任务。

Lines 104-109 现在是先注册 timeout、后 put。当 timeout 为 0 或非常小时,超时回调可能先执行到 fail(taskId, ...),因为 map 里还没有 entry,这次失败会被直接吞掉;随后插入的 entry 就会永久残留,并且对应 completion 永远不会结束。这是一个典型的 TOCTOU 并发 bug。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/src/main/java/org/zstack/core/rest/webhook/WebhookCallbackClient.java`
around lines 101 - 109, The submit method has a TOCTOU race: it submits the
timeout task (thdf.submitTimeoutTask) before inserting the PendingEntry into
pendingCalls so a very fast timeout can call fail(taskId, ...) while the map has
no entry, leaving a leaked pendingCalls entry; fix by creating the PendingEntry
(new PendingEntry<>(completion, /*receipt placeholder*/)) and putting it into
pendingCalls first, then call thdf.submitTimeoutTask to obtain the
TimeoutTaskReceipt and set that receipt into the existing PendingEntry (or
replace the map entry), ensuring the timeout handler always sees the entry;
update references in submit, PendingEntry, pendingCalls, thdf.submitTimeoutTask
and fail accordingly.
storage/src/main/java/org/zstack/storage/volume/VolumeBase.java-315-317 (1)

315-317: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

在切换 installPath 入库之后,不要再把后续同步失败升级成整个 reinit 失败。

Lines 315-317 和 Lines 515-517 已经把新安装路径持久化到数据库了,但 Lines 340-343 和 Lines 540-543 失败时仍然走整体 error 分支。这样调用方看到的是“重置失败”,实际卷却已经切到了新路径且不会回滚,后续重试会在已生效的新状态上再次执行。这里更安全的做法是把后续大小同步降级为 best-effort,或者把所有必须成功的步骤都放在 installPath 切换之前。

Also applies to: 340-343, 515-517, 540-543

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@storage/src/main/java/org/zstack/storage/volume/VolumeBase.java` around lines
315 - 317, The code in VolumeBase that sets the new install path
(vo.setInstallPath(...) followed by vo = dbf.updateAndRefresh(vo) and
dbInstallPathSwitched = true) persists the new path but still treats subsequent
size-sync failures as a full reinit failure; change the error handling after the
installPath has been persisted so subsequent syncs are best-effort: detect
dbInstallPathSwitched == true and on failures from the size synchronization
steps (the branches that currently go to the overall error path) log a warning
and continue/return success rather than escalating to reinit failure, or
alternatively move all mandatory steps before calling
vo.setInstallPath/dbf.updateAndRefresh so persistence only happens after all
must-succeed actions; update the branches that currently transition to the error
path (the ones after vo.setInstallPath/dbf.updateAndRefresh and the
corresponding later occurrences) to implement the best-effort downgrade.
storage/src/main/java/org/zstack/storage/volume/VolumeBase.java-554-558 (1)

554-558: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

不要对数据卷 reinit 复用 AfterReimageVmInstanceExtensionPoint

Lines 554-558 在数据卷重置成功后触发的是“VM reimage”扩展点。这个钩子的实现很容易默认自己收到的是根卷/VM 级事件;一旦扩展抛错,这个 API 会在数据卷已经重置成功后反向报错。这里应移除该回调,或者引入专门的数据卷 reinit 扩展点。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@storage/src/main/java/org/zstack/storage/volume/VolumeBase.java` around lines
554 - 558, The code currently invokes AfterReimageVmInstanceExtensionPoint for
data-volume reinit (the loop calling
ext.afterReimageVmInstance(dataVolumeInventory)), which is incorrect; remove
that invocation and either drop the callback entirely for data-volume reinit or
introduce and call a dedicated extension point (e.g.,
DataVolumeReinitExtensionPoint with a method like
afterDataVolumeReinit(DataVolumeInventory)) instead; locate the loop over
pluginRgty.getExtensionList(AfterReimageVmInstanceExtensionPoint.class) and
delete the ext.afterReimageVmInstance(dataVolumeInventory) calls (or replace the
extension lookup/call with the new DataVolumeReinitExtensionPoint and its
method).
core/src/main/java/org/zstack/core/rest/webhook/WebhookCallbackClient.java-81-87 (1)

81-87: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

start() 不要覆盖预先设置的 callbackUrl

Lines 139-140 已经提供了显式覆写入口,但 Lines 81-87 每次启动都会把 callbackUrl 重置成 restf.getSendCommandUrl()。如果调用方在 start() 前按专用回调端点场景设置了自定义 URL,发给外部系统的回调地址会被静默改错。

💡 建议修改
 public void start() {
-    this.callbackUrl = restf.getSendCommandUrl();
+    if (this.callbackUrl == null) {
+        this.callbackUrl = restf.getSendCommandUrl();
+    }
     restf.registerSyncHttpCallHandler(
             protocol.getCallbackPath(),
             protocol.getCallbackClass(),
             this::onCallback);
 }

Also applies to: 139-140

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@core/src/main/java/org/zstack/core/rest/webhook/WebhookCallbackClient.java`
around lines 81 - 87, start() currently unconditionally overwrites
this.callbackUrl with restf.getSendCommandUrl(); instead, preserve any pre-set
custom callbackUrl and only default it when not set—e.g., change the assignment
in start() to set this.callbackUrl = restf.getSendCommandUrl() only if
this.callbackUrl is null/empty (leave custom values intact); keep the
restf.registerSyncHttpCallHandler(protocol.getCallbackPath(),
protocol.getCallbackClass(), this::onCallback) call unchanged so explicit
overrides (the existing setter/override flow) still function.
plugin/sdnController/src/main/java/org/zstack/sdnController/SdnControllerManagerImpl.java-273-279 (1)

273-279: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

不要把 tag 持久化失败降级成成功响应。

这里把 createTagsFromAPICreateMessage() 和 inventory 回填包在同一个 try/catch 里后,tag 写入失败也只会打 warn,最终仍然发布 success event。这样 API 调用方会收到“创建成功”,但请求里的 systemTags 可能根本没落库,行为和请求已经不一致了。建议只把 inventory 组装当成 best-effort;tag 创建失败至少要把 event 标成失败,必要时再补清理逻辑。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugin/sdnController/src/main/java/org/zstack/sdnController/SdnControllerManagerImpl.java`
around lines 273 - 279, Separate tag persistence from inventory assembly: do not
swallow failures from tagMgr.createTagsFromAPICreateMessage into a success
event. Call tagMgr.createTagsFromAPICreateMessage(vo.getUuid(), ...) in its own
try/catch; if it throws, mark the API event (the variable event) as failed
(populate the event error / setSuccess(false) and publish/return the failure)
instead of continuing to publish a success. Treat event.setInventory(...) /
reading the DB as best-effort in a separate try/catch so inventory backfill
doesn't override tag creation errors; only proceed to publish a success event
when tag creation succeeded.
compute/src/main/java/org/zstack/compute/vm/VmNicLifecycleManager.java-342-353 (1)

342-353: ⚠️ Potential issue | 🟠 Major

在 release 路径中,doCleanup() 应该使用当前宿主而不是目标宿主。

doCleanup() 目前通过 spec.getDestHost() 获取清理宿主,但在 stop/destroy 等释放资源的操作中,目标宿主(destHost)为 null。这导致所有 NIC 清理逻辑被静默跳过。

相比之下,attach/detach 路径使用 spec.getVmInventory().getHostUuid() 直接获取当前宿主。对于释放资源操作,应该清理 VM 当前所在的宿主,而非一个不存在的目标宿主。

建议改为:

String hostUuid = spec.getVmInventory().getHostUuid();
if (hostUuid == null) {
    completion.done();
    return;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@compute/src/main/java/org/zstack/compute/vm/VmNicLifecycleManager.java`
around lines 342 - 353, The doCleanup method currently uses spec.getDestHost()
to determine hostUuid which is null for release paths; change it to use the VM's
current host via spec.getVmInventory().getHostUuid() and null-check that value
before proceeding (i.e., replace the spec.getDestHost() check and hostUuid
assignment with retrieval of hostUuid from spec.getVmInventory(), call
completion.done() if it's null, then call runCleanup with the existing
parameters); update references in VmNicLifecycleManager.doCleanup accordingly.
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java-3062-3063 (1)

3062-3063: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

把快照 DB 同步扩展的异常转成受控失败。

Line 3062 在 agent 已经成功创建快照后,直接执行 Lines 3115-3117 的 DB 同步扩展。这里如果任一扩展抛出运行时异常,异常会从异步 REST 回调直接冒出去,当前 flow 拿不到 trigger.fail(...),很容易留下 hypervisor 已成功、DB 未同步的悬挂快照。这里至少需要 catch 扩展异常并显式 fail 当前 flow,避免继续走成功回包。

Also applies to: 3115-3117

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java` around lines 3062 -
3063, After the hypervisor reports snapshot creation success, the extension
callbacks (calls to syncSnapshotMetadataAfterHypervisorSuccess(...) and
extEmitter.afterTakeSnapshot((KVMHostInventory) getSelfInventory(), msg, cmd,
ret)) can throw runtime exceptions that currently escape the async callback and
leave a hypervisor-created snapshot with DB unsynced; wrap each extension
invocation in a try/catch that catches Throwable, convert the error into a
controlled flow failure by calling trigger.fail(...) with a clear error message
and the caught exception, and return/stop further success processing so the flow
does not send a success reply; apply the same try/catch+trigger.fail pattern to
both the syncSnapshotMetadataAfterHypervisorSuccess call and the
extEmitter.afterTakeSnapshot call (including the place at the later invocation)
so all extension exceptions are handled deterministically.
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java-5249-5258 (1)

5249-5258: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

不要直接用 agent 的绝对时间去裁剪 MN 侧 pending HTTP 调用。

Line 5250 的 agentStartTimeMillis 来自宿主机,而这里清理的是管理节点本地记录的 pending call。两边时钟只要有偏差,就会误杀重连后刚发出的新请求,或者完全漏掉真正的陈旧请求。更稳妥的做法是用 MN 本地记录的 connect/reconnect 起始时间做边界,或者改成基于 boot-id / 相对启动时长判断。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java` around lines 5249 -
5258, 不要用远端 agent 的绝对时间 rsp.getAgentStartTimeMillis() 去裁剪管理节点的 pending
调用;改成用管理节点本地记录的该连接的 connect/reconnect 起始时间或基于 boot-id/相对启动时长来算边界,然后将该本地时间传给
restf.failPendingCallsForResourceBefore(替换掉当前对 rsp.getAgentStartTimeMillis()
的依赖)并在日志里打印该本地时间或 boot-id/相对时长以便排查;具体修改点:在处理 rsp.isFirstConnect() 的分支中,不再用
rsp.getAgentStartTimeMillis() 计算 cutoffMillis,而改为读取管理节点保存的连接开始时间(或
boot-id/连接序号)并用该值调用 restf.failPendingCallsForResourceBefore(self.getUuid(),
cutoffMillis, ...),同时调整 logger.info 的参数以反映所使用的本地边界值或 boot-id。
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java-922-926 (1)

922-926: ⚠️ Potential issue | 🟠 Major

添加 purpose 参数到 APIGetVmStartingCandidateClustersHostsMsg 并在 handler 中转发。

APIGetVmStartingCandidateClustersHostsMsg 目前缺少 purpose 字段。在 handle() 方法(第 840 行)只转发了 uuid,导致 API 调用方无法指定列表候选主机的 purpose 为 LIST_CANDIDATES。这样经由 API 发起的候选主机查询在下游到达 line 922-926 时,GetVmStartingCandidateClustersHostsMsg 的 purpose 始终是默认的 ALLOCATE,无法告诉下游过滤器这是一个候选列表操作而非真实分配。

需要:

  1. APIGetVmStartingCandidateClustersHostsMsg 添加 purpose 字段(@APIParam)
  2. handle() 方法中将 API 消息的 purpose 转发到内部 GetVmStartingCandidateClustersHostsMsg
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java` around lines
922 - 926, Add a new purpose field to the API message and forward it into the
internal message: update APIGetVmStartingCandidateClustersHostsMsg to include a
purpose property annotated with `@APIParam` (matching the enum type used by
GetVmStartingCandidateClustersHostsMsg), and in the handler method that builds
and sends the GetVmStartingCandidateClustersHostsMsg (the handle() that
currently only forwards uuid) set amsg.setPurpose(apiMsg.getPurpose()) so the
internal GetVmStartingCandidateClustersHostsMsg receives the purpose (allowing
LIST_CANDIDATES to be propagated).
🟡 Minor comments (3)
plugin/loadBalancer/src/main/java/org/zstack/network/service/lb/LoadBalancerApiInterceptor.java-790-804 (1)

790-804: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

健康检查协议/目标解析前建议统一 trim() 入参。

当前直接解析和比较字符串,用户从 UI 粘贴带空格/换行时会导致误判(冲突、协议不支持、target 非法)。建议在读取 msg 与 systemTag token 后先做 trim 标准化。

As per coding guidelines: **/*Interceptor.java: 注意检查来自 Message 的参数是否做过 trim,用户可能在浏览器上复制粘贴的数据带有空格、换行符等

Also applies to: 1467-1478

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugin/loadBalancer/src/main/java/org/zstack/network/service/lb/LoadBalancerApiInterceptor.java`
around lines 790 - 804, The health-check target/protocol parsing uses raw tokens
from getSystemTagTokens and may mis-handle user input with leading/trailing
whitespace; trim the token strings immediately after retrieving them (e.g.,
normalize entries in healthCheckTargets and the derived healthCheckTarget) so
subsequent calls like isNoneHealthCheckTargetWithSpecificPort and
getHealthCheckProtocolFromTarget operate on trimmed values; apply the same
trimming normalization wherever similar parsing occurs (see the other affected
block referenced around lines 1467-1478) and ensure comparisons against
LoadBalancerSystemTags use trimmed strings.
test/src/test/java/org/zstack/test/kvm/KVMHostUtilsTest.java-141-143 (1)

141-143: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

断言方式会产生子串误判。

这里用 deployIps.contains(ip) 校验可能把 10.0.0.7 误判为命中 10.0.0.70。建议按逗号拆分后做精确匹配。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/src/test/java/org/zstack/test/kvm/KVMHostUtilsTest.java` around lines
141 - 143, The test uses substring matching which can false-positive (e.g.,
"10.0.0.7" matching "10.0.0.70"); in the loop over detected.split(",") replace
the loose check using deployIps.contains(ip) with an exact-match check by
parsing deployIps into discrete entries (e.g., split deployIps by "," into a Set
or list) and assert that the resulting collection contains the exact ip string;
update the assertion in KVMHostUtilsTest where variables detected and deployIps
are used to perform precise membership checks rather than substring contains.
docs/design/zns-pivot/ZCF-2133-zns-pivot-detailed-design.md-1329-1329 (1)

1329-1329: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

补全这些代码块的语言标记,并检查 Line 1329 是否是误留的围栏。

这些 fence 现在会持续触发 markdownlint;其中 Line 1329 还是一个孤立的 ,很容易把后面的章节错误地包进代码块。时序图/ASCII 图统一改成 ` text `,如果 Line 1329 不是有意开启代码块,建议直接删掉。

Also applies to: 1952-1952, 1975-1975, 1996-1996

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/zns-pivot/ZCF-2133-zns-pivot-detailed-design.md` at line 1329,
There are unmarked/incorrect Markdown fences causing markdownlint
failures—locate the isolated triple-backtick at the vicinity of the orphan fence
(the lone "```" around Line 1329) and either remove it if it was left
accidentally or close it properly; for all ASCII/timing/sequence diagrams and
other non-code blocks (including the fences near the other reported locations
around Lines 1952, 1975, 1996) change their opening fences from ``` to ```text
(or add an appropriate language tag) so the blocks are explicitly marked and no
subsequent sections are accidentally captured as code.

Comment on lines +556 to 557
vo.setType(msg.getType() != null ? msg.getType() : L3NetworkConstant.L3_BASIC_NETWORK_TYPE);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

默认类型回退后,工厂选择仍使用 msg.getType() 会触发空指针

Line 556 已允许 msg.getType() 为空并回退默认值,但后续创建流程仍在 Line 590 使用 L3NetworkType.valueOf(msg.getType())。当请求未传 type 时,这里会直接失败,导致创建 L3 网络失败。

建议修复
-                L3NetworkFactory factory = getL3NetworkFactory(L3NetworkType.valueOf(msg.getType()));
+                L3NetworkFactory factory = getL3NetworkFactory(L3NetworkType.valueOf(vo.getType()));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@network/src/main/java/org/zstack/network/l3/L3NetworkManagerImpl.java` around
lines 556 - 557, 在 L3NetworkManagerImpl 创建流程中虽然在 vo.setType(...) 已将
msg.getType() 为空时回退为 L3NetworkConstant.L3_BASIC_NETWORK_TYPE,但后续使用
L3NetworkType.valueOf(msg.getType()) 会在 msg.getType() 为 null 时抛
NPE;请修改创建类型的来源为已经解析后的值(例如使用 vo.getType() 或在本地变量中保存 resolvedType = msg.getType()
!= null ? msg.getType() : L3NetworkConstant.L3_BASIC_NETWORK_TYPE),并将所有对
L3NetworkType.valueOf(...) 的调用改为传入该 resolvedType,以确保在 msg.getType() 为 null
时也能正确创建网络(参考符号:vo.setType(...)、L3NetworkType.valueOf(msg.getType())、L3NetworkManagerImpl)。

Comment on lines +3105 to +3107
VolumeSnapshotInventory snapshot = new VolumeSnapshotInventory();
snapshot.setUuid(msg.getSnapshotName());
snapshot.setVolumeUuid(volume.getUuid());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== TakeSnapshotOnHypervisorMsg =="
fd 'TakeSnapshotOnHypervisorMsg\.java$' -x sh -c 'echo "----- $1 -----"; sed -n "1,220p" "$1"' sh {}

echo
echo "== VolumeSnapshotDBSyncExtensionPoint =="
fd 'VolumeSnapshotDBSyncExtensionPoint\.java$' -x sh -c 'echo "----- $1 -----"; sed -n "1,220p" "$1"' sh {}

echo
echo "== Related usages =="
rg -n -C2 'getSnapshotName\(|getSnapshotUuid\(|syncVolumeSnapshotDBAfterTakeSnapshot|setUuid\(' --type=java

Repository: MatheMatrix/zstack

Length of output: 50376


🏁 Script executed:

# 查看 KVMHost.java 3105-3107 行的完整上下文
sed -n '3095,3120p' plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java

# 查找快照 UUID 的生成或赋值
rg -n 'VolumeSnapshotInventory.*uuid|snapshot.*setUuid|snapshotUuid.*=' plugin/kvm -A2 -B2 | head -50

# 查找对 syncVolumeSnapshotDBAfterTakeSnapshot 的实现
fd '.*\.java$' -x grep -l 'syncVolumeSnapshotDBAfterTakeSnapshot' | head -10

Repository: MatheMatrix/zstack

Length of output: 2034


🏁 Script executed:

# 检查 VolumeSnapshotManagerImpl 中的相关实现
sed -n '1,100p' ./storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java | head -50

# 搜索快照 UUID 生成的模式
rg -n 'UUID\.randomUUID|new UUID|setUuid.*UUID' plugin/kvm -A2 -B2 | grep -i snapshot | head -30

# 查找 VolumeSnapshotManagerImpl 中 syncVolumeSnapshotDBAfterTakeSnapshot 的实现
rg -n 'syncVolumeSnapshotDBAfterTakeSnapshot' ./storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java -A20

# 检查其他快照代码中 UUID 的设置方式
rg -n 'new VolumeSnapshotInventory\(\)|snapshot\.setUuid' plugin/kvm -B3 -A3 | head -60

Repository: MatheMatrix/zstack

Length of output: 6917


🏁 Script executed:

# 检查是否在调用 takeSnapshot 前创建了 VolumeSnapshotVO
rg -n 'syncSnapshotMetadataAfterHypervisorSuccess' plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java -B50 | grep -E 'new VolumeSnapshotVO|create.*snapshot|VolumeSnapshot.*save|persist' | head -20

# 查找 takeSnapshot 的完整调用链
rg -n 'syncSnapshotMetadataAfterHypervisorSuccess|TakeSnapshotOnHypervisorMsg' plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java -B5 | head -80

# 查看 VolumeSnapshotConstant 确认快照类型定义
rg -n 'HYPERVISOR_SNAPSHOT_TYPE|class VolumeSnapshotConstant' header/src/main/java/org/zstack/header/storage/snapshot/ -A5 | head -40

Repository: MatheMatrix/zstack

Length of output: 2846


🏁 Script executed:

# 查找创建 TakeSnapshotOnHypervisorMsg 的地方
rg -n 'new TakeSnapshotOnHypervisorMsg\|TakeSnapshotOnHypervisorMsg.*=' --type=java -B5 -A8 | head -100

# 查找 VolumeSnapshotVO 的创建位置(应该在 takeSnapshot 前)
rg -n 'new VolumeSnapshotVO\|dbf.persistAndRefresh.*VolumeSnapshotVO' storage/src/main/java/org/zstack/storage/snapshot/ -B3 -A3 | head -80

# 查找 takeSnapshot 方法的调用方
rg -n 'bus.send.*TakeSnapshotOnHypervisorMsg' storage/src/main/java/org/zstack/storage/snapshot/ -B10 -A5 | head -100

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# 查找在 VolumeSnapshotManagerImpl 中如何调用 TakeSnapshotOnHypervisorMsg
rg -n 'TakeSnapshotOnHypervisorMsg' storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java -B10 -A10

# 查找快照创建流程中 VolumeSnapshotVO 的插入
rg -n 'createSnapshot|takeSnapshot' storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java -A20 | head -120

# 查看 snapshot.getUuid() 在 syncVolumeSnapshotDBAfterTakeSnapshot 之前是如何被赋值的
sed -n '700,750p' storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java

Repository: MatheMatrix/zstack

Length of output: 2342


🏁 Script executed:

# 查找调用 TakeSnapshotOnHypervisorMsg 的完整流程
rg -n 'new TakeSnapshotOnHypervisorMsg\|bus.send.*TakeSnapshot' storage/src/main/java/org/zstack/storage/snapshot/ -B15 -A5 | head -150

# 查看 CreateVolumeSnapshotMsg 的处理流程
rg -n 'handle.*CreateVolumeSnapshotMsg' storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java -A50 | head -100

# 查找快照记录在何处创建
rg -n 'VolumeSnapshotVO vo = ' storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java -B5 -A10 | head -80

Repository: MatheMatrix/zstack

Length of output: 8686


🏁 Script executed:

# 查找创建快照时的完整链条,特别是如何关联 snapshotName 和 UUID
sed -n '810,920p' storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java

# 查找 TakeSnapshotOnHypervisorMsg 的构造方式
rg -n 'setSnapshotName|TakeSnapshotOnHypervisorMsg' storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java -B3 -A3 | head -100

Repository: MatheMatrix/zstack

Length of output: 5596


🏁 Script executed:

# 继续查看创建快照流程中 TakeSnapshotOnHypervisorMsg 的创建部分
sed -n '900,1000p' storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java

# 查找 snapshotName 在 TakeSnapshotOnHypervisorMsg 中如何被赋值
rg -n 'msg.setSnapshotName|getSnapshotName' storage/src/main/java/org/zstack/storage/snapshot/ -B5 -A2

Repository: MatheMatrix/zstack

Length of output: 5324


🏁 Script executed:

# 查看整个创建快照流程中 VolumeSnapshotVO 是如何与 snapshotName 关联的
sed -n '770,810p' storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java

# 查看 TakeSnapshotMsg 处理中是否保留了原始快照 UUID
rg -n 'class TakeSnapshotMsg|setSnapshot' header/src/main/java/org/zstack/header/storage/primary/ -B2 -A10 | head -80

Repository: MatheMatrix/zstack

Length of output: 11036


🏁 Script executed:

# 查看 VolumeSnapshotStruct 中是否包含快照 UUID
rg -n 'class VolumeSnapshotStruct|getSnapshotUuid|getCurrentSnapshot' header/src/main/java/org/zstack/header/storage/snapshot/ -A5 | head -80

# 查找在 KVMHost 中如何获取真实的快照 UUID
rg -n 'TakeSnapshotOnHypervisorMsg' plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java -B20 | grep -E 'snapshot|Snapshot' | head -30

# 检查是否存在其他地方设置了真实的 snapshotUuid
rg -n 'snapshotUuid' plugin/kvm/src/main/java/org/zstack/kvm/ -B2 -A2 | head -40

Repository: MatheMatrix/zstack

Length of output: 9224


🏁 Script executed:

# 查看 TakeSnapshotOnHypervisorMsg 中是否有 VolumeSnapshotStruct 信息
cat header/src/main/java/org/zstack/header/host/TakeSnapshotOnHypervisorMsg.java

# 查看在 VolumeSnapshotManagerImpl 中如何将 snapshot 信息传递给 TakeSnapshotOnHypervisorMsg
rg -n 'struct.getCurrent|setSnapshotName' storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java -B2 -A2

Repository: MatheMatrix/zstack

Length of output: 1955


🏁 Script executed:

# 查找在什么地方调用 setSnapshotName 以及使用什么值
rg -n 'setSnapshotName' plugin/kvm storage/ header/ --type=java -B3 -A3 | head -80

# 查看 TakeSnapshotOnHypervisorMsg 是在哪里创建和配置的
rg -n 'new TakeSnapshotOnHypervisorMsg\|TakeSnapshotOnHypervisorMsg msg = ' plugin/kvm -B5 -A15 | head -100

Repository: MatheMatrix/zstack

Length of output: 762


应将 snapshot.uuid 设置为真实的快照 UUID,而不是 msg.getSnapshotName()

第 3106 行将 msg.getSnapshotName() 写入了 snapshot.uuid。由于 TakeSnapshotOnHypervisorMsg 仅包含超程序返回的快照名称(非 UUID),而 syncVolumeSnapshotDBAfterTakeSnapshot() 会使用该 uuid 字段作为主键进行数据库查询,快照名称无法匹配数据库中预先创建的快照记录。这导致数据库同步失败,返回 null,造成快照元数据同步完全失效,数据库记录与超程序状态不一致。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java` around lines 3105 -
3107, The code is incorrectly assigning msg.getSnapshotName() to
VolumeSnapshotInventory.uuid; replace that with the actual snapshot UUID from
the pre-created DB record (e.g. the variable or object holding the snapshot's
real UUID or preCreatedSnapshot.getUuid()) so
VolumeSnapshotInventory.setUuid(...) contains the true UUID (not the
hypervisor-returned name), and ensure the same UUID is passed to
syncVolumeSnapshotDBAfterTakeSnapshot() and any subsequent DB lookups; locate
the assignment in the block creating VolumeSnapshotInventory and swap the source
to the real snapshot UUID instead of
TakeSnapshotOnHypervisorMsg.getSnapshotName().

Normalize SDN controller inventory endpoint host extraction, including bracketed IPv6 with ports and trimmed host values.

Validation: mvn -pl header -DskipTests -Djacoco.skip=true compile passed in prior reviewCommentFix run.

Resolves: ZCF-2747

Change-Id: I00939db3552081f4822762e138fb4652d3513f7c
@MatheMatrix MatheMatrix force-pushed the sync/shixin.ruan/shixin-ZCF-2747@@2 branch from aea6000 to 8495919 Compare May 20, 2026 11:13
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/src/test/groovy/org/zstack/test/unittest/network/sdncontroller/SdnControllerInventoryCase.groovy (1)

9-15: ⚡ Quick win

补充 null 与仅空白 endpoint 的回归断言。

当前用例已覆盖主要路径,但生产逻辑显式处理了 nulltrim,建议补两条断言,避免后续回归。

建议补充的测试片段
     void testInventoryIpStripsEndpointPort() {
         assert inventoryIp(" 10.0.0.10:8080 ") == "10.0.0.10"
         assert inventoryIp(" controller.example.com:6640 ") == "controller.example.com"
         assert inventoryIp(" [fe80::1]:6640 ") == "fe80::1"
         assert inventoryIp(" 2001:db8::1 ") == "2001:db8::1"
         assert inventoryIp(" controller.example.com ") == "controller.example.com"
+        assert inventoryIp(null) == null
+        assert inventoryIp("   ") == ""
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@test/src/test/groovy/org/zstack/test/unittest/network/sdncontroller/SdnControllerInventoryCase.groovy`
around lines 9 - 15, Update the testInventoryIpStripsEndpointPort test to add
regression assertions for null and all-whitespace endpoints: call
inventoryIp(null) and inventoryIp("   ") and assert they return null (matching
the production logic that handles null and trims input); place these two
assertions alongside the existing asserts in the
testInventoryIpStripsEndpointPort method so future changes won't regress
handling of null/blank endpoints.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@test/src/test/groovy/org/zstack/test/unittest/network/sdncontroller/SdnControllerInventoryCase.groovy`:
- Around line 9-15: Update the testInventoryIpStripsEndpointPort test to add
regression assertions for null and all-whitespace endpoints: call
inventoryIp(null) and inventoryIp("   ") and assert they return null (matching
the production logic that handles null and trims input); place these two
assertions alongside the existing asserts in the
testInventoryIpStripsEndpointPort method so future changes won't regress
handling of null/blank endpoints.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 0370d145-1ab8-4d4f-88a6-8f208593de68

📥 Commits

Reviewing files that changed from the base of the PR and between aea6000 and 8495919.

📒 Files selected for processing (2)
  • header/src/main/java/org/zstack/header/network/sdncontroller/SdnControllerInventory.java
  • test/src/test/groovy/org/zstack/test/unittest/network/sdncontroller/SdnControllerInventoryCase.groovy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.