fix(lvg-conditions-watcher): atomically update Ready condition and phase to prevent stuck NotReady by AleksZimin · Pull Request #245 · deckhouse/sds-node-configurator

AleksZimin · 2026-04-22T08:32:29Z

Description

Why do we need it, and what problem does it solve?

What is the expected result?

Checklist

The code is covered by unit tests.
e2e tests passed.
Documentation updated according to the changes.
Changes were tested in the Kubernetes cluster manually.

LVG status.phase can get stuck in NotReady even when every condition is True because reconcileLVGConditions issued two sequential Status().Update() calls (one for the Ready condition, one for the phase) and a concurrent writer could bump resourceVersion between them, making the second call fail with 409 Conflict. The happy-path branch only logged that error, so Status.Conditions stayed consistent, reflect.DeepEqual(...) in the watcher's UpdateFunc swallowed subsequent events, and the phase never converged. Fix: derive the desired (phase, Ready status/reason/message) via a pure decideLVGReadyAndPhase function and apply both mutations with a single Status().Update() in updateLVGReadyConditionAndPhaseIfNeeded. Any error, including 409, is returned to the reconciler so controller-runtime can requeue and re-read a fresh object. Tests: - table-driven unit tests for decideLVGReadyAndPhase covering missing conditions, Creating/Terminating short-circuits, non-acceptable False reasons, acceptable False reasons (Updating/ValidationFailed), happy path, and the Ready condition being ignored in its own evaluation. - integration tests for updateLVGReadyConditionAndPhaseIfNeeded on a fake client with WithStatusSubresource(LVMVolumeGroup): verify the helper performs exactly one Status().Update() (observed via resourceVersion), is a no-op when both fields already match, and does not duplicate the Ready condition when mutating an existing one. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>

Previously in CI we set suite.FailFast=true, which means the first failing spec aborts the suite and later specs (including potentially unrelated regressions and the cluster teardown-side paths) never run. That hides downstream failures and makes diagnosing multi-failure runs expensive. Changes: * always FailFast=false — every spec runs to completion, even when an earlier spec has already failed. go test still exits non-zero when any spec failed, so CI remains honest. * new ReportAfterSuite writes a JUnit XML (e2e-junit-report.xml) and a JSON report (e2e-report.json) into ${E2E_REPORT_DIR:-.}, and prints an explicit "E2E SUITE SUMMARY / FAILED SPEC(S)" block to GinkgoWriter listing every failed spec (full text + location + first-line message). This guarantees the failure list is visible in the raw test log even when artifact upload does not happen. * build_dev.yml: set E2E_REPORT_DIR=\${{ github.workspace }}/e2e so the reports land next to e2e-test-output.log and add an Upload test reports step that publishes both files as a separate artifact. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>

Local helper that runs the exact same Go lint flow as the deckhouse/modules-actions/go_linter@v12 action used in go_checks.yaml: * installs a pinned golangci-lint (v2.9.0 by default, overridable via GOLANGCI_LINT_VERSION) into hack/.bin so the local version never drifts from CI; * iterates every go.mod under images/ and every build tag in GO_BUILD_TAGS ("ce ee se seplus csepro"), running `golangci-lint run --allow-parallel-runners --build-tags <edition>`; * keeps going after individual failures and exits non-zero if any run failed. Adds two extras for local use: * --fix / --fix-only to apply `golangci-lint fmt` (gci, gofmt, goimports, ...) before running linters; * --module / --edition to narrow the matrix to a single module/tag when iterating on a specific file. hack/.bin/ (cached golangci-lint binary) is added to .gitignore. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>

github-actions · 2026-04-22T09:48:36Z

E2E Smoke Tests Results ❌

Module: sds-node-configurator
Framework: storage-e2e
Namespace: e2e-sds-node-configurator-pr245-24769936987
Status: Tests failed

View Details

Workflow: Build and checks
Run ID: 24769936987
Commit: 8619d0c
PR: fix(lvg-conditions-watcher): atomically update Ready condition and phase to prevent stuck NotReady #245
Triggered by: e2e-smoke-test label
Exit Code: 1

Tests were executed via storage-e2e (TestSdsNodeConfigurator: Common Scheduler Extender, then Sds Node Configurator).

Artifacts: Test logs are available in workflow artifacts.

…bled On RedOS 8 multipathd.service is shipped enabled but inactive by default, so `systemctl is-enabled --quiet multipathd` returned 0 while the unit was not running. The handler then called `systemctl reload multipathd`, which prints "multipathd.service is not active, cannot reload." and exits non-zero. That non-zero status propagated out of the bb-sync-file-changed event handler and bashible retried the whole 100_sds-node-configurator-add-loop-devices-to-blacklist.sh step every 10 seconds indefinitely. Switch the predicate to `systemctl is-active`, which is what reload actually requires. If multipathd is stopped there is nothing to reload — it will pick up the new /etc/multipath/conf.d/loop-blacklist.conf at its next start. Absent, masked and disabled units behave the same as before (is-active returns non-zero, handler does nothing). Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>

AleksZimin self-assigned this Apr 22, 2026

AleksZimin changed the title ~~fix(controller): LVMVolumeGroup stuck in NotReady despite all conditions True~~ fix(lvg-conditions-watcher): atomically update Ready condition and phase to prevent stuck NotReady Apr 22, 2026

AleksZimin added 3 commits April 22, 2026 11:55

AleksZimin force-pushed the fix/lvg-phase-stuck-notready branch from 6d58865 to e6c3a6d Compare April 22, 2026 08:57

AleksZimin added the e2e-smoke-test label Apr 22, 2026

github-actions Bot removed the e2e-smoke-test label Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lvg-conditions-watcher): atomically update Ready condition and phase to prevent stuck NotReady#245

fix(lvg-conditions-watcher): atomically update Ready condition and phase to prevent stuck NotReady#245
AleksZimin wants to merge 4 commits into
mainfrom
fix/lvg-phase-stuck-notready

AleksZimin commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AleksZimin commented Apr 22, 2026

Description

Why do we need it, and what problem does it solve?

What is the expected result?

Checklist

Uh oh!

github-actions Bot commented Apr 22, 2026

E2E Smoke Tests Results ❌

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant