Skip to content

[ceph]: skip ceph ops on non-enabled cluster timeout#3947

Open
zstack-robot-1 wants to merge 2 commits into
5.5.22from
sync/songtao.liu/fix/ZSTAC-80275-v2
Open

[ceph]: skip ceph ops on non-enabled cluster timeout#3947
zstack-robot-1 wants to merge 2 commits into
5.5.22from
sync/songtao.liu/fix/ZSTAC-80275-v2

Conversation

@zstack-robot-1
Copy link
Copy Markdown
Collaborator

Resolves: ZSTAC-80275

sync from gitlab !9838

Skip Ceph secret creation when cluster is not Enabled to prevent reconnect timeout.
Add 5-minute await timeout to FutureCompletion to prevent permanent blocking.

Resolves: ZSTAC-80275

Change-Id: Iaa61109103cc5b4ee9bf86485e1777456b172b56
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 989af711-5f6b-4dc3-996c-3af88fce3ffc

📥 Commits

Reviewing files that changed from the base of the PR and between 2dd5d63 and 918a667.

📒 Files selected for processing (1)
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephKvmExtension.java
🚧 Files skipped from review as they are similar to previous changes (1)
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephKvmExtension.java

Walkthrough

在 connectionReestablished 中按主机 clusterUuid 查询 ClusterVO;若无记录或集群状态非 ClusterState.Enabled 则记录日志并返回;否则继续创建 Ceph 密钥,等待最多 5 分钟再检查结果。

变更说明

集群状态保护与超时管理

层 / 文件 说明
导入与 logger 初始化
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephKvmExtension.java
新增对 UtilsCLoggerClusterStateClusterVOTimeUnit 的导入,并初始化类级别 logger
ClusterVO 查询与状态校验
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephKvmExtension.java
connectionReestablished 中按主机的 clusterUuid 查询 ClusterVO;若集群不存在则 log.warn 并返回;若存在但状态非 ClusterState.Enabledlog.info 并返回,跳过后续 Ceph secret 创建。
密钥创建等待超时
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephKvmExtension.java
将等待密钥创建完成的调用从无超时的 completion.await() 改为 completion.await(TimeUnit.MINUTES.toMillis(5))(5 分钟超时)。

Sequence Diagram(s)

sequenceDiagram
  participant CephKvmExtension
  participant DB as ClusterVO(dbf)
  participant CephSecretCreator
  participant Completion
  CephKvmExtension->>DB: dbf.findByUuid(host.clusterUuid)
  alt ClusterVO not found
    CephKvmExtension->>CephKvmExtension: log.warn and return
  else ClusterVO found but state != ClusterState.Enabled
    CephKvmExtension->>CephKvmExtension: log.info and return
  else ClusterState.Enabled
    CephKvmExtension->>CephSecretCreator: invoke createCephSecret
    CephSecretCreator->>Completion: start and await(5 minutes)
    Completion->>CephKvmExtension: isSuccess / result
  end
Loading

代码审查工作量估计

🎯 3 (中等) | ⏱️ ~20 分钟

兔儿查集群,若未启便歇步,
密钥生事忙,五分钟做个度,
若无或停摆,日志轻声述,
小小超时定,流程更稳妥 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed 标题格式正确遵循 [scope]: ,长度54字符在72字符限制内,但对变更内容的描述不够清晰完整。
Description check ✅ Passed 描述虽简洁,但明确关联到JSTAC-80275工单和GitLab同步源(!9838),与变更内容相关。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/songtao.liu/fix/ZSTAC-80275-v2

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephKvmExtension.java (1)

82-82: ⚡ Quick win

将超时时间提取为具名常量,避免魔法值。

Line 82 直接写 5 可读性偏弱,后续调整也容易遗漏;建议提取为类级常量。

♻️ 建议修改
 public class CephKvmExtension implements KVMHostConnectExtensionPoint, HostConnectionReestablishExtensionPoint {
     private static final CLogger logger = Utils.getLogger(CephKvmExtension.class);
+    private static final long SECRET_CREATION_AWAIT_TIMEOUT_MILLIS = TimeUnit.MINUTES.toMillis(5);

@@
-        completion.await(TimeUnit.MINUTES.toMillis(5));
+        completion.await(SECRET_CREATION_AWAIT_TIMEOUT_MILLIS);

As per coding guidelines "避免使用魔法值(Magic Value):直接使用未经定义的数值或字符串应替换为枚举或常量。"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephKvmExtension.java`
at line 82, Replace the magic literal 5 in the
completion.await(TimeUnit.MINUTES.toMillis(5)) call by introducing a descriptive
class-level constant (e.g., COMPLETION_AWAIT_TIMEOUT_MINUTES or TIMEOUT_MINUTES)
in CephKvmExtension and use that constant in the await call
(completion.await(TimeUnit.MINUTES.toMillis(COMPLETION_AWAIT_TIMEOUT_MINUTES)));
ensure the constant is static final and documented so future adjustments are
centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephKvmExtension.java`:
- Line 82: Replace the magic literal 5 in the
completion.await(TimeUnit.MINUTES.toMillis(5)) call by introducing a descriptive
class-level constant (e.g., COMPLETION_AWAIT_TIMEOUT_MINUTES or TIMEOUT_MINUTES)
in CephKvmExtension and use that constant in the await call
(completion.await(TimeUnit.MINUTES.toMillis(COMPLETION_AWAIT_TIMEOUT_MINUTES)));
ensure the constant is static final and documented so future adjustments are
centralized.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 426ed645-091f-47b6-b49d-d76d35791157

📥 Commits

Reviewing files that changed from the base of the PR and between 57a496a and 2ababed.

📒 Files selected for processing (1)
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephKvmExtension.java

@MatheMatrix
Copy link
Copy Markdown
Owner

Comment from yaohua.wu:

Review: MR !9838 — ZSTAC-80275

P1 — High

# File:Line 问题 修复建议
1 CephKvmExtension.java:58 cluster == null 时未提前返回,代码继续执行 createSecret。在集群被删除或数据异常的情况下(race condition:物理机重连回调触发时集群恰好被删除),findByUuid 返回 null,当前条件 cluster != null && cluster.getState() != Enabled 对 null 不拦截,直接落入后续逻辑,会引发难以定位的下游错误。 将 null 判断独立提前:if (cluster == null) { logger.warn(...); return; } 再判断 state,而不是将两个条件用 && 合并。

P2 — Moderate

# File:Line 问题 修复建议
1 CephKvmExtension.java:59 注释 // skip Ceph storage operations if cluster is not Enabled (ZSTAC-80275) 以 ticket key 作为唯一说明,缺少独立技术描述。 改为描述约束本身,例如 // cluster disabled: skip Ceph secret creation to avoid timeout,ticket key 可作为补充但不应是核心内容。

总结

本 MR 针对 ZSTAC-80275(集群停用状态下重连 Ceph 主存储超时)做了两处修复:① 在 connectionReestablished 入口增加集群状态守卫,非 Enabled 状态直接跳过;② 将 completion.await() 无限等待改为 5 分钟超时。超时修复正确覆盖了 Jira 描述的根因;集群状态守卫逻辑正确,但 cluster == null 的边界情况需要独立拦截,避免数据异常时继续执行引发误导性错误。

Verdict: REVISION_REQUIRED


🤖 Robot Reviewer

@MatheMatrix MatheMatrix force-pushed the sync/songtao.liu/fix/ZSTAC-80275-v2 branch 2 times, most recently from 7139ea4 to 2dd5d63 Compare May 19, 2026 08:16
P1: Separate null check from state check. When findByUuid returns null
(cluster deleted during race with reconnect callback), the original
`cluster != null && cluster.getState() != Enabled` short-circuits false
and falls through to createSecret(), causing downstream errors. Now null
cluster triggers early return with warn-level log.

P2: Improve comment to describe the constraint technically instead of
only referencing the ticket key.

Resolves: ZSTAC-80275
@MatheMatrix MatheMatrix force-pushed the sync/songtao.liu/fix/ZSTAC-80275-v2 branch from 2dd5d63 to 918a667 Compare May 19, 2026 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants